+

404: Page not found

-

Sorry, we've misplaced that URL or it's pointing to something that doesn't exist. Head back home to try finding it again.

+

Sorry, we've misplaced that URL or it's pointing to something that doesn't exist. Head back home to try finding it again.

+ +

- {{ content }} -

+

Contributing

+

Contributions of new or missing publications are very welcome. Alternative categorization/taxonomies can also be added to the website. To contribute, please open a pull request, but first please read the instructions below.

+ +

Adding a publication

+

To add a publication (new or missing), create a file in the _publications folder. The name of the file should follow the structure lastnameYEARfirstword.markdown where lastname is the last name of the first author and firstword is the first non-punctuation word of the work’s title. Within each file, follow the structure shown in the other files. Once the file is added, the work will appear in the “All Papers” section.

+ +

---
+layout: publication
+title: The title of the Publication
+authors: F. M. LastName, F. M. LastName, ...
+conference: AbbreviatedNameOfConference  # Or journal: AbbreviatedNameOfJournal
+year: YEAR
+additional_links:
+  - {name: "ArXiV", url: "http://arxiv.org/abs/XXXX.YYYY"}
+  - {name: "website", url: "http://paperwebsite.com"}
+  - {name: "code", url: "https://github.com/path-to/code"}
+tags: ["tag1", "tag2"]
+---
+Text of abstract goes here.
+

+ +

The additional_links are optional and arbitrary and they will appear on the page referring to this work. Feel free to add as many additional links as needed.

+ +

Adding a new categorization

+

No single taxonomy or categorization can fit everyone. It is easy to contribute a new categorization to be shown in this website. First, create a data file, similar to those in the _data file describing your taxonomy. +This can be a JSON, YAML or CSV file as described here. +Then, create a folder and a page (or pages) that describe your taxonomy. Finally, submit a pull +request to get this merged into the website.

+ +

Reusing the website structure

+

In principle, the structure of this website can be used for other literature reviews. Feel free to clone it!

+ + +

+ +

+

Machine Learning on Source Code

+ +

The billions of lines of source code that have been written contain +implicit knowledge about how to write good code, code that is +easy to read and to debug. +A recent line of research aims to find statistical patterns in large +corpora of code to drive new software development tools and program +analyses.

+ +

This website and the accompanying article surveys the work in this emerging area.

+ +

Like writing and speaking, software development is an act of human communication. +At its core, +the naturalness of software employs statistical modeling over big code to +reason about rich variety of programs developers write. This new line of +research is inherently interdisciplinary, uniting the machine learning and +natural language processing communities with software engineering +and programming language communities.

+ +

🏷 Browse Papers by Tag

+ +adversarial +API +autocomplete +benchmark +benchmarking +bimodal +Binary Code +clone +code completion +code generation +code similarity +compilation +completion +cybersecurity +dataset +decompilation +defect +deobfuscation +documentation +dynamic +edit +editing +education +evaluation +execution +feature location +fuzzing +generalizability +generation +GNN +grammar +human evaluation +information extraction +instruction tuning +interpretability +language model +large language models +LLM +logging +memorization +metrics +migration +naming +natural language generation +natural language processing +notebook +optimization +pattern mining +plagiarism detection +pretraining +program analysis +program synthesis +question answering +refactoring +repair +representation +retrieval +Reverse Engineering +review +search +static +static analysis +style +summarization +survey +synthesis +test generation +tool +topic modeling +topic modelling +traceability +Transformer +Transformers +translation +types +variable misuse +verification +vulnerability + +

About This Site

+ +

This site is an experiment: a living literature review that allows +you explore, search and navigate the literature in this area. +The full survey is available as a research paper. +Please cite as

+

+@article{allamanis2018survey,
+  title={A survey of machine learning for big code and naturalness},
+  author={Allamanis, Miltiadis and Barr, Earl T and Devanbu, Premkumar and Sutton, Charles},
+  journal={ACM Computing Surveys (CSUR)},
+  volume={51},
+  number={4},
+  pages={81},
+  year={2018},
+  publisher={ACM}
+}
+

+ +

Contributing

+ +

This research area is evolving so fast that a static review cannot keep up. +But a website can! We hope to make this site a living document. +Anyone can add a paper to this web site, essentially by creating one Markdown file. + To contribute, open a pull request in GitHub, by following these instructions +for contributing.

+ +

Contributors

+ +

The core survey and the original taxonomy was created by

+ +

Miltos Allamanis Microsoft Research, Cambridge, UK
Earl T. Barr University College London, London, UK
Prem Devanbu University of California, Davis, USA
Charles Sutton University of Edinburgh and The Alan Turing Institute, UK

+ +

Contributors to the website

+

This website accepts external contributions. +Please, feel free to add your name below, once you contribute to this +website. A comprehensive list can be found here.

+ +

Uri Alon Technion, Israel
Shaked Brody Technion, Israel
Nghi D. Q. Bui Singapore Management University, Singapore
Rajaswa Patil Microsoft PROSE

+ +

+ Search across all paper titles, abstracts, authors by using the search field. Please consider contributing by updating the information of existing papers or adding new work. - - -{% assign publicationsByYear = site.publications | sort: "year" | group_by: "year" %} -{% for year in publicationsByYear reversed %} -{% for publication in year.items %} +

Year	Title	Authors	Venue	Abstract

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - + - - - - + + + + -{% endfor %} -{% endfor %} + +

Year	Title	Authors	Venue	Abstract
2024	LLM4Decompile: Decompiling Binary Code with Large Language Models + + + + +	Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang		Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at this https URL +	decompilation translation evaluation large language models LLM
2024	DebugBench: Evaluating Debugging Capability of Large Language Models + + + + +	Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Zhiyuan Liu, Maosong Sun		Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs’ debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench’, an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and three open-source models in a zero-shot scenario. We find that (1) while closed-source models like GPT-4 exhibit inferior debugging performance compared to humans, open-source models such as Code Llama fail to attain any pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging. +	repair
2024	Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search + + + + +	Haochen Li, Xin Zhou, Zhiqi Shen		In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of Large Language Models (LLMs). Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances. +	search large language models metrics
2024	DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence + + + + +	Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang		The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use. +	Transformers
2024	T5APR: Empowering Automated Program Repair across Languages through Checkpoint Ensemble + + + + +	Reza Gharibi, Mohammad Hadi Sadreddini, Seyed Mostafa Fakhrahmad		Automated program repair (APR) using deep learning techniques has become an important area of research in recent years, aiming to automatically generate bug-fixing patches that can improve software reliability and maintainability. However, most existing methods either target a single language or require high computational resources to train multilingual models. In this paper, we propose T5APR, a novel neural program repair approach that provides a unified solution for bug fixing across multiple programming languages. T5APR leverages CodeT5, a powerful pre-trained text-to-text transformer model, and adopts a checkpoint ensemble strategy to improve patch recommendation. We conduct comprehensive evaluations on six well-known benchmarks in four programming languages (Java, Python, C, JavaScript), demonstrating T5APR’s competitiveness against state-of-the-art techniques. T5APR correctly fixes 1,985 bugs, including 1,442 bugs that none of the compared techniques has fixed. We further support the effectiveness of our approach by conducting detailed analyses, such as comparing the correct patch ranking among different techniques. The findings of this study demonstrate the potential of T5APR for use in real-world applications and highlight the importance of multilingual approaches in the field of APR. +	repair Transformer
2024	PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models + + + + +	Simin Chen, Xiaoning Feng, Xiaohong Han, Cong Liu, Wei Yang	FSE 2024	In recent times, a plethora of Large Code Generation Models (LCGMs) have been proposed, showcasing significant potential in assisting developers with complex programming tasks. Benchmarking LCGMs necessitates the creation of a set of diverse programming problems, and each problem comprises the prompt (including the task description), canonical solution, and test inputs. The existing methods for constructing such a problem set can be categorized into two main types: manual methods and perturbation-based methods. However, manual methods demand high effort and lack scalability, while also risking data integrity due to LCGMs’ potentially contaminated data collection, and perturbation-based approaches mainly generate semantically homogeneous problems with the same canonical solutions and introduce typos that can be easily auto-corrected by IDE, making them ineffective and unrealistic. In this work, we propose the idea of programming problem merging (PPM) and provide two implementation of this idea, we utilize our tool on two widely-used datasets and compare it against nine baseline methods using eight code generation models. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems, comparing to the baselines. +	benchmarking evaluation
2024	A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks + + + + +	Beatrice Casey, Joanna C. S. Santos, George Perry		Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what’s not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall. +	survey cybersecurity vulnerability
2024	Can Large Language Model Detect Plagiarism in Source Code? + + + + +	William Brach, Kristián Košťál, Michal Ries	FLLM	The issue of code plagiarism represents a significant challenge in the academic environment. This study examines the potential of large language models (LLMs) in improving the detection of code plagiarism. The performance of several LLMs, including GPT-4o, GPT3.5 Turbo, LLaMA 3, and CodeLlama, is evaluated in comparison to conventional tools, such as JPlag, across a range of levels of code plagiarism. The findings of our study illustrate that state-of-the-art LLMs are able to outperform traditional methods, particularly in the detection of sophisticated forms of plagiarism. GPT-4o exhibited the highest overall accuracy (78.70%) and an F1 score of 86.97%. It is important to note that open-source models, such as LLaMA 3 (accuracy 71.53%, F1 score 82.75%), demonstrated the ability to detect the most complex forms of plagiarism with the same accuracy as GPT-4o. While these results demonstrate the promising potential of LLMs in code similarity analysis, it is also evident that higher false positive rates may be an inherent limitation, emphasizing the need for human oversight. This study contributes valuable insights into the application of AI in maintaining code integrity and academic honesty, paving the way for more effective, interpretable, and fair plagiarism detection systems in software development education and practice. +	code similarity large language models LLM plagiarism detection natural language processing
2024	RepairAgent: An Autonomous, LLM-Based Agent for Program Repair + + + + +	Islem Bouzenia, Premkumar Devanbu, Michael Pradel		Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent’s effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI’s GPT-3.5 model, translates to 14 cents of USD per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering. +	repair
2024	DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models + + + + +	Berkay Berabi, Alexey Gronskiy, Veselin Raychev, Gishor Sivanrupan, Victor Chibotaru, Martin Vechev		The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging large language models (LLMs), which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM’s attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix. +	repair vulnerability
2024	Studying LLM Performance on Closed- and Open-source Data + + + + +	Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty		Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS –> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning. +	Transformers
2024	Predictive Program Slicing via Execution Knowledge-Guided Dynamic Dependence Learning + + + + +	Aashish Yadavally, Yi Li, Tien N. Nguyen	FSE	Program slicing, the process of extracting program statements that influence values at a designated location (known as the slicing criterion), is helpful in both manual and automated debugging. However, such slicing techniques prove ineffective in scenarios where executing specific inputs is prohibitively expensive, or even impossible, as with partial code. In this paper, we introduce ND-Slicer, a predictive slicing methodology that caters to specific executions based on a particular input, overcoming the need for actual execution. We enable such a process by leveraging execution-aware pre-training to learn the dynamic program dependencies, including both dynamic data and control dependencies between variables in the slicing criterion and the remaining program statements. Such knowledge forms the cornerstone for constructing a predictive backward slice. Our empirical evaluation revealed a high accuracy in predicting program slices, achieving an exact-match accuracy of 81.3% and a ROUGE-LCS F1-score of 95.4% on Python programs. As an extrinsic evaluation, we illustrate ND-Slicer’s usefulness in crash detection, with it locating faults with an accuracy of 63.9%. Furthermore, we include an in-depth qualitative evaluation, assessing ND-Slicer’s understanding of branched structures such as if-else blocks and loops, as well as the control flow in inter-procedural calls. +	large language models program analysis dynamic tool
2024	A Learning-Based Approach to Static Program Slicing + + + + +	Aashish Yadavally, Yi Li, Shaohua Wang, Tien N. Nguyen	OOPSLA	Traditional program slicing techniques are crucial for early bug detection and manual/automated debugging of online code snippets. Nevertheless, their inability to handle incomplete code hinders their real-world applicability in such scenarios. To overcome these challenges, we present NS-Slicer, a novel learning-based approach that predicts static program slices for both complete and partial code. Our tool leverages a pre-trained language model to exploit its understanding of fine-grained variable-statement dependencies within source code. With this knowledge, given a variable at a specific location and a statement in a code snippet, NS-Slicer determines whether the statement belongs to the backward slice or forward slice, respectively. We conducted a series of experiments to evaluate NS-Slicer’s performance. On complete code, it predicts the backward and forward slices with an F1-score of 97.41% and 95.82%, respectively, while achieving an overall F1-score of 96.77%. Notably, in 85.20% of the cases, the static program slices predicted by NS-Slicer exactly match entire slices from the oracle. For partial programs, it achieved an F1-score of 96.77%–97.49% for backward slicing, 92.14%–95.40% for forward slicing, and an overall F1-score of 94.66%–96.62%. Furthermore, we demonstrate NS-Slicer’s utility in vulnerability detection (VD), integrating its predicted slices into an automated VD tool. In this setup, the tool detected vulnerabilities in Java code with a high F1-score of 73.38%. We also include the analyses studying NS-Slicer’s promising performance and limitations, providing insights into its understanding of intrinsic code properties such as variable aliasing, leading to better slicing. +	large language models program analysis static tool
2023	CodeT5+: Open Code Large Language Models for Code Understanding and Generation + + + + +	Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, Steven C. H. Hoi		Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+’’, a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs. +	Transformer
2023	DeepVD: Toward Class-Separation Features for Neural Network Vulnerability Detection + + + + +	Wenbo Wang, Tien N. Nguyen, Shaohua Wang, Yi Li, Jiyuan Zhang, Aashish Yadavally	ICSE	The advances of machine learning (ML) including deep learning (DL) have enabled several approaches to implicitly learn vulnerable code patterns to automatically detect software vulnerabilities. A recent study showed that despite successes, the existing ML/DL-based vulnerability detection (VD) models are limited in the ability to distinguish between the two classes of vulnerability and benign code. We propose DeepVD, a graph-based neural network VD model that emphasizes on class-separation features between vulnerability and benign code. DeepVD leverages three types of class-separation features at different levels of abstraction: statement types (similar to Part-of-Speech tagging), Post-Dominator Tree (covering regular flows of execution), and Exception Flow Graph (covering the exception and error-handling flows). We conducted several experiments to evaluate DeepVD in a real-world vulnerability dataset of 303 projects with 13,130 vulnerable methods. Our results show that DeepVD relatively improves over the state-of-the-art ML/DL-based VD approaches 13%–29.6% in precision, 15.6%–28.9% in recall, and 16.4%–25.8% in F-score. Our ablation study confirms that our designed features and components help DeepVD achieve high class-separability for vulnerability and benign code. +	vulnerability
2023	LExecutor: Learning-Guided Execution + + + + +	Beatriz Souza, Michael Pradel		Executing code is essential for various program analysis tasks, e.g., to detect bugs that manifest through exceptions or to obtain execution traces for further dynamic analysis. However, executing an arbitrary piece of code is often difficult in practice, e.g., because of missing variable definitions, missing user inputs, and missing third-party dependencies. This paper presents LExecutor, a learning-guided approach for executing arbitrary code snippets in an underconstrained way. The key idea is to let a neural model predict missing values that otherwise would cause the program to get stuck, and to inject these values into the execution. For example, LExecutor injects likely values for otherwise undefined variables and likely return values of calls to otherwise missing functions. We evaluate the approach on Python code from popular open-source projects and on code snippets extracted from Stack Overflow. The neural model predicts realistic values with an accuracy between 80.1% and 94.2%, allowing LExecutor to closely mimic real executions. As a result, the approach successfully executes significantly more code than any available technique, such as simply executing the code as-is. For example, executing the open-source code snippets as-is covers only 4.1% of all lines, because the code crashes early on, whereas LExecutor achieves a coverage of 50.1%. + +	execution
2023	RepoFusion: Training Code Models to Understand Your Repository + + + + +	Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak		Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ($\sim73\times$ larger) and closely match the performance of the $\sim 70\times$ larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at \url{https://huggingface.co/RepoFusion}. +	completion
2023	RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair + + + + +	André Silva, Sen Fang, Martin Monperrus		Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tunes LLMs with naive code representations and is fundamentally limited in its ability to fine-tune larger LLMs. To address this problem, we propose RepairLLaMA, a novel program repair approach that combines 1) code representations for APR and 2) the state-of-the-art parameter-efficient LLM fine-tuning technique called LoRA. This results in RepairLLaMA producing a highly effective `program repair adapter’ for fixing bugs with language models. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals. Second, parameter-efficient fine-tuning helps fine-tuning to converge and contributes to the effectiveness of the repair adapter to fix data-points outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 125 Defects4J v2 and 82 HumanEval-Java bugs, outperforming all baselines. +	repair
2023	Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models + + + + +	Iman Saberi, Fateme H. Fard	MSR	Pre-trained Programming Language Models (PPLMs) achieved many recent states of the art results for many code-related software engineering tasks. Though some studies use data flow or propose tree-based models that utilize Abstract Syntax Tree (AST), most PPLMs do not fully utilize the rich syntactical information in source code. Still, the input is considered a sequence of tokens. There are two issues; the first is computational inefficiency due to the quadratic relationship between input length and attention complexity. Second, any syntactical information, when needed as an extra input to the current PPLMs, requires the model to be pre-trained from scratch, wasting all the computational resources already used for pre-training the current models. In this work, we propose Named Entity Recognition (NER) adapters, lightweight modules that can be inserted into Transformer blocks to learn type information extracted from the AST. These adapters can be used with current PPLMs such as CodeBERT, GraphCodeBERT, and CodeT5. We train the NER adapters using a novel Token Type Classification objective function (TTC). We insert our proposed work in CodeBERT, building CodeBERTER, and evaluate the performance on two tasks of code refinement and code summarization. CodeBERTER improves the accuracy of code refinement from 16.4 to 17.8 while using 20% of training parameter budget compared to the fully fine-tuning approach, and the BLEU score of code summarization from 14.75 to 15.90 while reducing 77% of training parameters compared to the fully fine-tuning approach. +	Transformer repair summarization
2023	Generative Type Inference for Python + + + + +	Yun Peng, Chaozheng Wang, Wenxuan Wang, Cuiyun Gao, Michael R. Lyu		Python is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems. Supervised type inference approaches, while feature-agnostic, require large, high-quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem. However, their performance is limited. This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypeGen constructs example prompts from human annotations. TypeGen only requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, TypeGen enhances the interpretability of results through the use of the input-explanation-output strategy. Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match by using only five examples. Furthermore, TypeGen achieves substantial improvements of 27% to 84% compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-1 Exact Match. +	types
2023	Demystifying GPT Self-Repair for Code Generation + + + + +	Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama		Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair – in which the model debugs and fixes mistakes in its own code – has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4’s ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains. +	repair
2023	CodeGen2: Lessons for Training LLMs on Programming and Natural Languages + + + + +	Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, Yingbo Zhou		Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly. + + In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a “free lunch” hypothesis. For data distributions, the effect of a mixture distribution of programming and natural languages on model performance is explored. + + We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into four lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen2 +	Transformer
2023	OctoPack: Instruction Tuning Code Large Language Models + + + + +	Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, Shayne Longpre		Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack’s benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack. +	dataset instruction tuning
2023	SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models + + + + +	Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi Wei, Alvine Boaye Belle, Hung Viet Pham, Song Wang		We introduce SkipAnalyzer, a large language model (LLM)-powered tool for static code analysis. SkipAnalyzer has three components: 1) an LLM-based static bug detector that scans source code and reports specific types of bugs, 2) an LLM-based false-positive filter that can identify false-positive bugs in the results of static bug detectors (e.g., the result of step 1) to improve detection accuracy, and 3) an LLM-based patch generator that can generate patches for the detected bugs above. As a proof-of-concept, SkipAnalyzer is built on ChatGPT, which has exhibited outstanding performance in various software engineering tasks. To evaluate SkipAnalyzer, we focus on two types of typical and critical bugs that are targeted by static bug detection, i.e., Null Dereference and Resource Leak as subjects. We employ Infer to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that SkipAnalyzer achieves remarkable performance in the mentioned static analysis tasks, including bug detection, false-positive warning removal, and bug repair. In static bug detection, SkipAnalyzer achieves accuracy values of up to 68.37% for detecting Null Dereference bugs and 76.95% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer, by 12.86% and 43.13%, respectively. For removing false-positive warnings, SkipAnalyzer can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs. Additionally, SkipAnalyzer surpasses state-of-the-art false-positive warning removal tools. Furthermore, in bug repair, SkipAnalyzer can generate syntactically correct patches to fix its detected bugs with a success rate of up to 97.30%. +	repair
2023	Code Execution with Pre-trained Language Models + + + + +	Chenxiao Liu, Shuai Lu, Weizhu Chen, Daxin Jiang, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan, Nan Duan		Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution. +	Transformer execution
2023	Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets + + + + +	V. Lomshakov, S. Kovalchuk, M. Omelchenko, S. Nikolenko, A. Aliev	ICCS	We study the ability of pretrained large language models (LLM) to answer questions from online question answering fora such as Stack Overflow. We consider question-answer pairs where the main part of the answer consists of source code. On two benchmark datasets — CoNaLa and a newly collected dataset based on Stack Overflow — we investigate how a closed-book question answering system can be improved by fine-tuning the LLM for the downstream task, prompt engineering, and data preprocessing. We use publicly available autoregressive language models such as GPT-Neo, CodeGen, and PanGu-Coder, and after the proposed fine-tuning achieve a BLEU score of 0.4432 on the CoNaLa test set, significantly exceeding previous state of the art for this task. +	program synthesis question answering large language models
2023	StarCoder: may the source be with you! + + + + +	Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries		The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI `code-cushman-001` model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license. +	Transformer
2023	Rethinking Negative Pairs in Code Search + + + + +	Haochen Li, Xin Zhou, Luu Anh Tuan, Chunyan Miao	EMNLP	Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less ``negative’’ than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. +	search Transformer retrieval optimization representation
2023	Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation + + + + +	Xin-Ye Li, Jiang-Tian Xue, Zheng Xie, Ming Li		Code generation aims to automatically generate source code from high-level task specifications, which can significantly increase productivity of software engineering. Recently, approaches based on large language models (LLMs) have shown remarkable code generation abilities on simple tasks. However, generate code for more complex tasks, such as competition-level problems, remains challenging. In this paper, we introduce Brainstorm framework for code generation. It leverages a brainstorming step that generates and selects diverse thoughts on the problem to facilitate algorithmic reasoning, where the thoughts are possible blueprint of solving the problem. We demonstrate that Brainstorm significantly enhances the ability of LLMs to solve competition-level programming problems, resulting in a more than 50% increase in the pass@$k$ metrics for ChatGPT on the CodeContests benchmark, achieving state-of-the-art performance. Furthermore, our experiments conducted on LeetCode contests show that our framework boosts the ability of ChatGPT to a level comparable to that of human programmers. +	generation Transformer
2023	The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models + + + + +	Haonan Li, Yu Hao, Yizhuo Zhai, Zhiyun Qian		Static analysis is a widely used technique in software engineering for identifying and mitigating bugs. However, a significant hurdle lies in achieving a delicate balance between precision and scalability. Large Language Models (LLMs) offer a promising alternative, as recent advances demonstrate remarkable capabilities in comprehending, generating, and even debugging code. Yet, the logic of bugs can be complex and require sophisticated reasoning and a large analysis scope spanning multiple functions. Therefore, at this point, LLMs are better used in an assistive role to complement static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis, using use-before-initialization (UBI) bugs as a case study. To this end, we develop LLift, a fully automated agent that interfaces with both a static analysis tool and an LLM. By carefully designing the agent and the prompts, we are able to overcome a number of challenges, including bug-specific modeling, the large problem scope, the non-deterministic nature of LLMs, etc. Tested in a real-world scenario analyzing nearly a thousand potential UBI bugs produced by static analysis, LLift demonstrates an extremely potent capability, showcasing a high precision (50%) and recall rate (100%). It even identified 13 previously unknown UBI bugs in the Linux kernel. This research paves the way for new opportunities and methodologies in the use of LLMs for bug discovery in extensive, real-world datasets. +	static analysis
2023	Test-based and metric-based evaluation of code generation models for practical question answering + + + + +	S. Kovalchuk, D. Fedrushkov, V. Lomshakov, A. Aliev	ICCQ	We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don’t pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models. +	code generation test generation natural language generation evaluation metrics natural language processing
2023	Large Language Models and Simple, Stupid Bugs + + + + +	Kevin Jesse, Toufique Ahmed, Premkumar T. Devanbu, Emily Morgan		With the advent of powerful neural language models, AI-based systems to assist developers in coding tasks are becoming widely available; Copilot is one such system. Copilot uses Codex, a large language model (LLM), to complete code conditioned on a preceding “prompt”. Codex, however, is trained on public GitHub repositories, viz., on code that may include bugs and vulnerabilities. Previous studies [1], [2] show Codex reproduces vulnerabilities seen in training. In this study, we examine how prone Codex is to generate an interesting bug category, single statement bugs, commonly referred to as simple, stupid bugs or SStuBs in the MSR community. We find that Codex and similar LLMs do help avoid some SStuBs, but do produce known, verbatim SStuBs as much as 2x as likely than known, verbatim correct code. We explore the consequences of the Codex generated SStuBs and propose avoidance strategies that suggest the possibility of reducing the production of known, verbatim SStubs, and increase the possibility of producing known, verbatim fixes. +	Transformer defect
2023	RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation + + + + +	Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, Weizhu Chen		The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model, which allows for the effective utilization of repository-level information for code completion and grants the ability to generate code at various levels of granularity. Furthermore, RepoCoder utilizes a novel iterative retrieval-generation paradigm that bridges the gap between retrieval context and the intended completion target. We also propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder by using various combinations of code retrievers and generators. Experimental results indicate that RepoCoder significantly improves the zero-shot code completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research. +	completion Transformer retrieval
2023	Grace: Language Models Meet Code Edits + + + + +	Priyanshu Gupta, Avishree Khare, Yasharth Bajpai, Saikat Chakraborty, Sumit Gulwani, Aditya Kanade, Arjun Radhakrishna, Gustavo Soares, Ashish Tiwari	FSE	Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) with the knowledge of relevant prior associated edits, which we call the Grace (Generation conditioned on Associated Code Edits) method. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, Grace boosts the performance of the LLMs significantly, enabling them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively. +	editing
2023	Automatically Testing Functional Properties of Code Translation Models + + + + +	Hasan Ferit Eniser, Valentin Wüstholz, Maria Christakis	AAAI	Large language models are becoming increasingly practical for translating code across programming languages, a process known as $transpiling$. Even though automated transpilation significantly boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, property-based testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations. +	translation
2023	CodeScore: Evaluating Code Generation by Learning Code Execution + + + + +	Yihong Dong, Jiazheng Ding, Xue Jiang, Zhuo Li, Ge Li, Zhi Jin		A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing CEMs can be categorized into match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) and execution-based CEMs (e.g., AvgPassRatio and Pass@k), but both of them suffer from some issues. The former only measures differences in surface form regardless of the functional equivalence of codes, while the latter has huge execution overheads, including collecting expensive test cases, resolving tedious execution dependencies, and enormous execution time. To address these issues, in this paper, we propose CodeScore, an efficient and effective CEM for code generation, which estimates test case PassRatio of generated code without executing code. We also present a framework named UniCE for training unified code evaluation models by learning code execution, i.e., learning PassRatio and Executability of generated code. In order to learn code execution comprehensively, we construct more than 100 test cases for each task in several popular benchmark datasets, covering MBPP, APPS, and HumanEval. Experimental results show that CodeScore has obtained a state-of-the-art correlation with execution-based CEMs. CodeScore is strongly correlated with AvgPassPatio, and binary CodeScore is moderately correlated with Pass@1. In particular, CodeScore eliminates the need for test cases and execution dependencies in inference, and CodeScore reduces execution time by three orders of magnitude compared to AvgPassPatio and Pass@1. +	Transformer evaluation
2023	A Static Evaluation of Code Completion by Large Language Models + + + + +	Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang		Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven’t been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions. +	LLM static analysis
2023	Beware of the Unexpected: Bimodal Taint Analysis + + + + +	Yiu Wai Chow, Max Schäfer, Michael Pradel	ISSTA	Static analysis is a powerful tool for detecting security vulnerabilities and other programming problems. Global taint tracking, in particular, can spot vulnerabilities arising from complicated data flow across multiple functions. However, precisely identifying which flows are problematic is challenging, and sometimes depends on factors beyond the reach of pure program analysis, such as conventions and informal knowledge. For example, learning that a parameter `name` of an API function `locale` ends up in a file path is surprising and potentially problematic. In contrast, it would be completely unsurprising to find that a parameter `command` passed to an API function `execaCommand` is eventually interpreted as part of an operating-system command. This paper presents Fluffy, a bimodal taint analysis that combines static analysis, which reasons about data flow, with machine learning, which probabilistically determines which flows are potentially problematic. The key idea is to let machine learning models predict from natural language information involved in a taint flow, such as API names, whether the flow is expected or unexpected, and to inform developers only about the latter. We present a general framework and instantiate it with four learned models, which offer different trade-offs between the need to annotate training data and the accuracy of predictions. We implement Fluffy on top of the CodeQL analysis framework and apply it to 250K JavaScript projects. Evaluating on five common vulnerability types, we find that Fluffy achieves an F1 score of 0.85 or more on four of them across a variety of datasets. +	static analysis
2023	Supersonic: Learning to Generate Source Code Optimizations in C/C++ + + + + +	Zimin Chen, Sen Fang, Martin Monperrus		Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic’s performance is benchmarked against OpenAI’s GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4. +	optimization
2023	DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection + + + + +	Yizheng Chen, Zhoujie Ding, Xinyun Chen, David Wagner		We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection. +Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. +However, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance. +	dataset Transformer vulnerability
2023	CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code + + + + +	Shuyan Zhou, Uri Alon, Sumit Agarwal, Graham Neubig		Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub. +	evaluation Transformer
2023	Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions + + + + +	Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Jacob Ginesin, Edward Berman, George Chakhnashvili, Anton Lozhkov, Carolyn Jane Anderson, Arjun Guha		A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit. +	editing
2023	TraceFixer: Execution Trace-Driven Program Repair + + + + +	Islem Bouzenia, Yangruibo Ding, Kexin Pei, Baishakhi Ray, Michael Pradel		When debugging unintended program behavior, developers can often identify the point in the execution where the actual behavior diverges from the desired behavior. For example, a variable may get assigned a wrong value, which then negatively influences the remaining computation. Once a developer identifies such a divergence, how to fix the code so that it provides the desired behavior? This paper presents TraceFixer, a technique for predicting how to edit source code so that it does not diverge from the expected behavior anymore. The key idea is to train a neural program repair model that not only learns from source code edits but also exploits excerpts of runtime traces. The input to the model is a partial execution trace of the incorrect code, which can be obtained automatically through code instrumentation, and the correct state that the program should reach at the divergence point, which the user provides, e.g., in an interactive debugger. Our approach fundamentally differs from current program repair techniques, which share a similar goal but exploit neither execution traces nor information about the desired program state. We evaluate TraceFixer on single-line mistakes in Python code. After training the model on hundreds of thousands of code edits created by a neural model that mimics real-world bugs, we find that exploiting execution traces improves the bug-fixing ability by 13% to 20% (depending on the dataset, within the top-10 predictions) compared to a baseline that learns from source code edits only. Applying TraceFixer to 20 real-world Python bugs shows that the approach successfully fixes 10 of them. +	Transformer repair dynamic
2023	Improving Few-Shot Prompts with Relevant Static Analysis Products + + + + +	Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, Earl T. Barr		Large Language Models (LLM) are a new class of computation engines, “programmed” via prompt engineering. We are still learning how to best “program” these LLMs to help developers. We start with the intuition that developers tend to consciously and unconsciously have a collection of semantics facts in mind when working on coding tasks. Mostly these are shallow, simple facts arising from a quick read. For a function, examples of facts might include parameter and local variable names, return expressions, simple pre- and post-conditions, and basic control and data flow, etc. + + One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them inherently capable of doing this simple level of “code analysis” and extracting such information, implicitly, while processing code: but are they, really? If they aren’t, could explicitly adding this information help? Our goal here is to investigate this question, using the code summarization task and evaluate whether automatically augmenting an LLM’s prompt with semantic facts explicitly, actually helps. + + Prior work shows that LLM performance on code summarization benefits from few-shot samples drawn either from the same-project or from examples found via information retrieval methods (such as BM25). While summarization performance has steadily increased since the early days, there is still room for improvement: LLM performance on code summarization still lags its performance on natural-language tasks like translation and text summarization. + + We find that adding semantic facts actually does help! This approach improves performance in several different settings suggested by prior work, including for two different Large Language Models. In most cases, improvement nears or exceeds 2 BLEU; for the PHP language in the challenging CodeSearchNet dataset, this augmentation actually yields performance surpassing 30 BLEU. +	summarization Transformer
2023	Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context + + + + +	Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, Sriram Rajamani	NeurIPS	Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. + + Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. + + We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen. +	autocomplete benchmark code completion code generation compilation completion dataset evaluation language model large language models program analysis static analysis tool
2023	(Partial) Program Dependence Learning + + + + +	Aashish Yadavally, Wenbo Wang, Shaohua Wang, Tien N. Nguyen	ICSE	Code fragments from developer forums often migrate to applications due to the code reuse practice. Owing to the incomplete nature of such programs, analyzing them to early determine the presence of potential vulnerabilities is challenging. In this work, we introduce NeuralPDA, a neural network-based program dependence analysis tool for both complete and partial programs. Our tool efficiently incorporates intra-statement and inter-statement contextual features into statement representations, thereby modeling program dependence analysis as a statement-pair dependence decoding task. In the empirical evaluation, we report that NeuralPDA predicts the CFG and PDG edges in complete Java and C/C++ code with combined F-scores of 94.29% and 92.46%, respectively. The F-score values for partial Java and C/C++ code range from 94.29%–97.17% and 92.46%–96.01%, respectively. We also test the usefulness of the PDGs predicted by NEURALPDA (i.e., PDG) on the downstream task of method-level vulnerability detection. We discover that the performance of the vulnerability detection tool utilizing PDG is only 1.1% less than that utilizing the PDGs generated by a program analysis tool. We also report the detection of 14 real-world vulnerable code snippets from StackOverflow by a machine learning-based vulnerability detection tool that employs the PDGs predicted by NeuralPDA for these code snippets. +	large language models program analysis static analysis tool
2023	Universal Fuzzing via Large Language Models + + + + +	Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, Lingming Zhang		Fuzzing has achieved tremendous success in discovering bugs and vulnerabilities in various software systems. Systems under test (SUTs) that take in programming or formal language as inputs, e.g., compilers, runtime engines, constraint solvers, and software libraries with accessible APIs, are especially important as they are fundamental building blocks of software development. However, existing fuzzers for such systems often target a specific language, and thus cannot be easily applied to other languages or even other versions of the same language. Moreover, the inputs generated by existing fuzzers are often limited to specific features of the input language, and thus can hardly reveal bugs related to other or new features. This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are wellsuited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 76 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 47 bugs already confirmed by developers as previously unknown. +	fuzzing
2023	TypeT5: Seq2seq Type Inference using Static Analysis + + + + +	Jiayi Wei, Greg Durrett, Isil Dillig	ICLR	There has been growing interest in automatically predicting missing type annotations in programs written in Python and JavaScript. While prior methods have achieved impressive accuracy when predicting the most common types, they often perform poorly on rare or complex types. In this paper, we present a new type inference method that treats type prediction as a code infilling task by leveraging CodeT5, a state-of-the-art seq2seq pre-trained language model for code. Our method uses static analysis to construct dynamic contexts for each code element whose type signature is to be predicted by the model. We also propose an iterative decoding scheme that incorporates previous type predictions in the model’s input context, allowing information exchange between related code elements. Our evaluation shows that the proposed approach, TypeT5, not only achieves a higher overall accuracy (particularly on rare and complex types) but also produces more coherent results with fewer type errors – while enabling easy user intervention. +	types Transformer
2022	ReACC: A Retrieval-Augmented Code Completion Framework + + + + +	Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, Alexey Svyatkovskiy		Code completion, which aims to predict the following code token(s) according to the code context, can improve the productivity of software development. Recent work has proved that statistical language modeling with transformers can greatly improve the performance in the code completion task via learning from large-scale source code datasets. However, current approaches focus only on code context within the file or project, i.e. internal context. Our distinction is utilizing “external” context, inspired by human behaviors of copying from the related code snippets when writing code. Specifically, we propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We adopt a stage-wise training approach that combines a source code retriever and an auto-regressive language model for programming language. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark. +	Transformer autocomplete
2022	Open-ended Knowledge Tracing + + + + +	Naiming Liu, Zichao Wang, Richard G. Baraniuk, Andrew Lan		In education applications, knowledge tracing refers to the problem of estimating students’ time-varying concept/skill mastery level from their past responses to questions and predicting their future performance. One key limitation of most existing knowledge tracing methods is that they treat student responses to questions as binary-valued, i.e., whether they are correct or incorrect. Response correctness analysis/prediction ignores important information on student knowledge contained in the exact content of the responses, especially for open-ended questions. In this paper, we conduct the first exploration into open-ended knowledge tracing (OKT) by studying the new task of predicting students’ exact open-ended responses to questions. Our work is grounded in the domain of computer science education with programming questions. We develop an initial solution to the OKT problem, a student knowledge-guided code generation approach, that combines program synthesis methods using language models with student knowledge tracing methods. We also conduct a series of quantitative and qualitative experiments on a real-world student code dataset to validate OKT and demonstrate its promise in educational applications. +	education code generation
2022	CodeReviewer: Pre-Training for Automating Code Review Activities + + + + +	Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan		Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review senario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews. +	review
2022	Exploring Representation-Level Augmentation for Code Search + + + + +	Haochen Li, Chunyan Miao, Cyril Leung, Yanxian Huang, Yuan Huang, Hongyu Zhang, Yanlin Wang	EMNLP	Code search, which aims at retrieving the most relevant code fragment for a given natural language query, is a common activity in software development practice. Recently, contrastive learning is widely used in code search research, where many data augmentation approaches for source code (e.g., semantic-preserving program transformation) are proposed to learn better representations. However, these augmentations are at the raw-data level, which requires additional code analysis in the preprocessing stage and additional training costs in the training stage. In this paper, we explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training, and based on this we propose a general format of representation-level augmentation that unifies existing methods. Then, we propose three new augmentation methods (linear extrapolation, binary interpolation, and Gaussian scaling) based on the general format. Furthermore, we theoretically analyze the advantages of the proposed augmentation methods over traditional contrastive learning methods on code search. We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. The experimental results show that our approach can consistently boost the performance of the studied code search models. +	search Transformer
2022	Topical: Learning Repository Embeddings from Source Code using Attention + + + + +	Agathe Lherondelle, Yash Satsangi, Fran Silavong, Shaltiel Eloul, Sean Moran	Arxiv	Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode +augments the software developer’s capabilities with code autogeneration, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script level +representation of code is sufficient, however, in many cases a repository level representation that takes into account various dependencies and repository structure is imperative, for example, +auto-tagging repositories with topics or auto-documentation of repository code etc. Existing methods for computing repository level representations suffer from (a) reliance on natural language +documentation of code (for example, README files) (b) naive aggregation of method/script-level representation, for example, by concatenation or averaging. This paper introduces Topical a +deep neural network to generate repository level embeddings of publicly available GitHub code repositories directly from source code. Topical incorporates an attention mechanism that projects the source code, the full dependency graph and the +script level textual information into a dense repository-level representation. To compute the repository-level representations, Topical is trained to predict the topics associated with a repository, on a dataset of publicly available GitHub repositories that +were crawled along with their ground truth topic tags. Our experiments show that the embeddings computed by Topical are able to outperform multiple baselines, including baselines +that naively combine the method-level representations through averaging or concatenation at the task of repository auto-tagging. Furthermore, we show that Topical’s attention mechanism outperforms naive aggregation methods when computing repositorylevel representations from script-level representation generated +by existing methods. Topical is a lightweight framework for computing repository-level representation of code repositories that scales efficiently with the number of topics and dataset size. +	representation topic modelling
2022	Human perceiving behavior modeling in evaluation of code generation models + + + + +	S. Kovalchuk, V. Lomshakov, A. Aliev	GEM	Within this study, we evaluated a series of code generation models based on CodeGen and GPTNeo to compare the metric-based performance and human evaluation. For a deeper analysis of human perceiving within the evaluation procedure we’ve implemented a 5-level Likert scale assessment of the model output using a perceiving model based on the Theory of Planned Behavior (TPB). Through such analysis, we showed an extension of model assessment as well as a deeper understanding of the quality and applicability of generated code for practical question answering. The approach was evaluated with several model settings in order to assess diversity in quality and style of answer. With the TPB-based model, we showed a different level of perceiving the model result, namely personal understanding, agreement level, and readiness to use the particular code. With such analysis, we investigate a series of issues in code generation as natural language generation (NLG) problems observed in a practical context of programming question-answering with code. +	code generation evaluation human evaluation
2022	The Stack: 3TB of permissively licensed source code + + + + +	Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries		Large Language Models (LLMs) play an ever-increasing role in the field of +Artificial Intelligence (AI)–not only for natural language processing but also +for code understanding and generation. To stimulate open and responsible +research on LLMs for code, we introduce The Stack, a 3.1 TB dataset +consisting of permissively licensed source code in 30 programming languages. +We describe how we collect the full dataset, construct a permissively licensed +subset, and present promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that +(1) near-deduplicating the data significantly boosts performance across all +experiments, and (2) it is possible to match previously reported HumanEval +and MBPP performance using only permissively licensed data. We make the +dataset available at https://hf.co/BigCode and give developers the possi- +bility to have their code removed from the dataset by following the instruc- +tions at https://www.bigcode-project.org/docs/about/the-stack/. +	dataset
2022	Learning to Reduce False Positives in Analytic Bug Detectors + + + + +	Anant Kharkar, Roshanak Zilouchian Moghaddam, Matthew Jin, Xiaoyu Liu, Xin Shi, Colin Clement, Neel Sundaresan	ICSE	Due to increasingly complex software design and rapid iterative development, code defects and security vulnerabilities are prevalent in modern software. In response, programmers rely on static analysis tools to regularly scan their codebases and find potential bugs. In order to maximize coverage, however, these tools generally tend to report a significant number of false positives, requiring developers to manually verify each warning. To address this problem, we propose a Transformer-based learning approach to identify false positive bug warnings. We demonstrate that our models can improve the precision of static analysis by 17.5%. In addition, we validated the generalizability of this approach across two major bug types: null dereference and resource leak. +	Transformer static analysis
2022	JEMMA: An Extensible Java Dataset for ML4Code Applications + + + + +	Anjan Karmakar, Miltiadis Allamanis, Romain Robbes	EMSE	Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code’s richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with. +	dataset
2022	Assemble Foundation Models for Automatic Code Summarization + + + + +	Jian Gu, Pasquale Salza, Harald C. Gall	SANER	Automatic code summarization is beneficial to software development and maintenance since it reduces the burden of manual tasks. Currently, artificial intelligence is undergoing a paradigm shift. The foundation models pretrained on massive data and finetuned to downstream tasks surpass specially customized models. This trend inspired us to consider reusing foundation models instead of learning from scratch. Based on this, we propose a flexible and robust approach for automatic code summarization based on neural networks. We assemble available foundation models, such as CodeBERT and GPT-2, into a single model named AdaMo. Moreover, we utilize Gaussian noise as the simulation of contextual information to optimize the latent representation. Furthermore, we introduce two adaptive schemes from the perspective of knowledge transfer, namely continuous pretraining and intermediate finetuning, and design intermediate stage tasks for general sequence-to-sequence learning. Finally, we evaluate AdaMo against a benchmark dataset for code summarization, by comparing it with state-of-the-art models. +	summarization documentation language model
2022	Learning To Predict User-Defined Types + + + + +	Kevin Jesse, Premkumar T. Devanbu, Anand Sawant	TSE	TypeScript is a widely adopted gradual typed language where developers can optionally type variables, functions, parameters and more. Probabilistic type inference approaches with ML (machine learning) work well especially for commonly occurring types such as boolean, number, and string. TypeScript permits a wide range of types including developer defined class names and type interfaces. These developer defined types, termed user-defined types, can be written within the realm of language naming conventions. The set of user-defined types is boundless and existing bounded type guessing approaches are an imperfect solution. Existing works either under perform in user-defined types or ignore user-defined types altogether. This work leverages a BERT-style pre-trained model, with multi-task learning objectives, to learn how to type user-defined classes and interfaces. Thus we present DIVERSETYPER, a solution that explores the diverse set of user-defined types by uniquely aligning classes and interfaces declarations to the places in which they are used. DIVERSETYPER surpasses all existing works including those that model user-defined types. +	Transformer types
2022	Semantic Robustness of Models of Source Code + + + + +	Jordan Henkel, Goutham Ramakrishnan, Zi Wang, Aws Albarghouthi, Somesh Jha, Thomas Reps	SANER	Deep neural networks are vulnerable to adversarial examples - small input perturbations that result in incorrect predictions. We study this problem for models of source code, where we want the neural network to be robust to source-code modifications that preserve code functionality. To facilitate training robust models, we define a powerful and generic adversary that can employ sequences of parametric, semantics-preserving program transformations. We then explore how, with such an adversary, one can train models that are robust to adversarial program transformations. We conduct a thorough evaluation of our approach and find several surprising facts: we find robust training to beat dataset augmentation in every evaluation we performed; we find that a state-of-the-art architecture (code2seq) for models of code is harder to make robust than a simpler baseline; additionally, we find code2seq to have surprising weaknesses not present in our simpler baseline model; finally, we find that robust models perform better against unseen data from different sources (as one might hope) - however, we also find that robust models are not clearly better in the cross-language transfer task. To the best of our knowledge, we are the first to study the interplay between robustness of models of code and the domain-adaptation and cross-language transfer tasks. +	adversarial naming
2022	On Distribution Shift in Learning-based Bug Detectors + + + + +	Jingxuan He, Luca Beurer-Kellner, Martin Vechev		Deep learning has recently achieved initial success in program analysis tasks such as bug detection. Lacking real bugs, most existing works construct training and test data by injecting synthetic bugs into correct programs. Despite achieving high test accuracy (e.g. >90%), the resulting bug detectors are found to be surprisingly unusable in practice, i.e., <10% precision when used to scan real software repositories. In this work, we argue that this massive performance difference is caused by distribution shift, i.e., a fundamental mismatch between the real bug distribution and the synthetic bug distribution used to train and evaluate the detectors. To address this key challenge, we propose to train a bug detector in two phases, first on a synthetic bug distribution to adapt the model to the bug detection domain, and then on a real bug distribution to drive the model towards the real distribution. During these two phases, we leverage a multi-task hierarchy, focal loss, and contrastive learning to further boost performance. We evaluate our approach extensively on three widely studied bug types, for which we construct new datasets carefully designed to capture the real bug distribution. The results demonstrate that our approach is practically effective and successfully mitigates the distribution shift: our learned detectors are highly performant on both our constructed test set and the latest version of open source repositories. +	defect
2022	I Speak, You Verify: Toward Trustworthy Neural Program Synthesis + + + + +	Darren Key, Wen-Ding Li, Kevin Ellis		We develop an approach for improving the trustworthiness and overall accuracy of program synthesizers based on large language models for source code. Given a natural language description of a programming problem, our method samples both candidate programs as well as candidate predicates specifying how the program should behave. We learn to analyze the agreement between programs and predicates to judge both which program is most likely to be correct, and also judge whether the language model is able to solve the programming problem in the first place. This latter capacity allows favoring high precision over broad recall: fostering trust by only proposing a program when the system is certain that it is correct. +	synthesis
2022	Semantic Similarity Metrics for Evaluating Source Code Summarization + + + + +	Sakib Haque, Zachary Eberhart, Aakash Bansal, Collin McMillan		Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization. +	human evaluation evaluation
2022	Productivity Assessment of Neural Code Completion + + + + +	Albert Ziegler, Eirini Kalliamvakou, Shawn Simister, Ganesh Sittampalam, Alice Li, Andrew Rice, Devon Rifkin, Edward Aftandilian	MAPS	Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers’ productivity, without being able to measure it directly. In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data. We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers’ perception of productivity. +	evaluation human evaluation
2022	UniXcoder: Unified Cross-Modal Pre-training for Code Representation + + + + +	Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin		Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder. +	Transformer
2022	Cross-Language Binary-Source Code Matching with Intermediate Representations + + + + +	Yi Gui, Yao Wan, Hongyu Zhang, Huifang Huang, Yulei Sui, Guandong Xu, Zhiyuan Shao, Hai Jin	SANER	Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks. +	code similarity clone
2022	Learning to Complete Code with Sketches + + + + +	Daya Guo, Alexey Svyatkovskiy, Jian Yin, Nan Duan, Marc Brockschmidt, Miltiadis Allamanis	ICLR	Code completion is usually cast as a language modelling problem, i.e., continuing an input in a left-to-right fashion. However, in practice, some parts of the completion (e.g., string literals) may be very hard to predict, whereas subsequent parts directly follow from the context. To handle this, we instead consider the scenario of generating code completions with “holes” inserted in places where a model is uncertain. We develop Grammformer, a Transformer-based model that guides code generation by the programming language grammar, and compare it to a variety of more standard sequence models. + + We train the models on code completion for C# and Python given partial code context. To evaluate models, we consider both ROUGE as well as a new metric RegexAcc that measures success of generating completions matching long outputs with as few holes as possible. In our experiments, Grammformer generates 10-50% more accurate completions compared to traditional generative models and 37-50% longer sketches compared to sketch-generating baselines trained with similar techniques. +	Transformer language model grammar
2022	DeepPERF: A Deep Learning-Based Approach For Improving Software Performance + + + + +	Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B. Clement, Neel Sundaresan, Chen Wu		Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepPERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepPERF on English and Source code corpora and followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in ~53% of the cases, getting ~34% of them verbatim in our expert-verified dataset of performance changes made by C# developers. Additionally, we evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that our model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations. So far we’ve submitted 19 pull-requests with 28 different performance optimizations and 11 of these PRs have been approved by the project owners. +	Transformer optimization
2022	InCoder: A Generative Model for Code Infilling and Synthesis + + + + +	Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, Mike Lewis		Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first generative model that is able to directly perform zero-shot code infilling, which we evaluate on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The InCoder models and code are publicly released at https://sites.google.com/view/incoder-code-models +	Transformer code generation naming summarization
2022	Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic? + + + + +	Jean-Baptiste Döderlein, Mathieu Acher, Djamel Eddine Khelladi, Benoit Combemale		Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently attracted attention in code assistants, with programs automatically written in a given programming language from a programming task description in natural language. They have the potential to save time and effort when writing code. However, these systems are currently poorly understood, preventing them from being used optimally. In this paper, we investigate the various input parameters of two language models, and conduct a study to understand if variations of these input parameters (e.g. programming task description and the surrounding context, creativity of the language model, number of generated solutions) can have a significant impact on the quality of the generated programs. We design specific operators for varying input parameters and apply them over two code assistants (Copilot and Codex) and two benchmarks representing algorithmic problems (HumanEval and LeetCode). Our results showed that varying the input parameters can significantly improve the performance of language models. However, there is a tight dependency when varying the temperature, the prompt and the number of generated solutions, making potentially hard for developers to properly control the parameters to obtain an optimal result. This work opens opportunities to propose (automated) strategies for improving performance. +	Transformer
2022	CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code + + + + +	Aryaz Eghbali, Michael Pradel	ASE	Recent years have brought a surge of work on predicting pieces +of source code, e.g., for code completion, code migration, program +repair, or translating natural language into code. All this work faces +the challenge of evaluating the quality of a prediction w.r.t. some +oracle, typically in the form of a reference solution. A common +evaluation metric is the BLEU score, an n-gram-based metric originally proposed for evaluating natural language translation, but +adopted in software engineering because it can be easily computed +on any programming language and enables automated evaluation at +scale. However, a key difference between natural and programming +languages is that in the latter, completely unrelated pieces of code +may have many common n-grams simply because of the syntactic +verbosity and coding conventions of programming languages. We +observe that these trivially shared n-grams hamper the ability of +the metric to distinguish between truly similar code examples and +code examples that are merely written in the same language. This +paper presents CrystalBLEU, an evaluation metric based on BLEU, +that allows for precisely and efficiently measuring the similarity of +code. Our metric preserves the desirable properties of BLEU, such +as being language-agnostic, able to handle incomplete or partially +incorrect code, and efficient, while reducing the noise caused by +trivially shared n-grams. We evaluate CrystalBLEU on two datasets +from prior work and on a new, labeled dataset of semantically equivalent programs. Our results show that CrystalBLEU can distinguish +similar from dissimilar code examples 1.9–4.5 times more effectively, when compared to the original BLEU score and a previously +proposed variant of BLEU for code. +	evaluation
2022	Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding + + + + +	Deze Wang, Zhouyang Jia, Shanshan Li, Yue Yu, Yun Xiong, Wei Dong, Xiangke Liao	ICSE	With the great success of pre-trained models, the pretrain-then-finetune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features that are invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to organize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models. + + We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models. +	representation language model
2022	TOGA: A Neural Method for Test Oracle Generation + + + + +	Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, Shuvendu K. Lahiri	ICSE	Testing is widely recognized as an important stage of the software +development lifecycle. Effective software testing can provide benefits such as bug finding, preventing regressions, and documentation. +In terms of documentation, unit tests express a unit’s intended +functionality, as conceived by the developer. A test oracle, typically expressed as an condition, documents the intended behavior +of a unit under a given test prefix. Synthesizing a functional test +oracle is a challenging problem, as it must capture the intended +functionality rather than the implemented functionality. +In this paper, we propose TOGA (a neural method for Test Oracle +GenerAtion), a unified transformer-based neural approach to infer +both exceptional and assertion test oracles based on the context of +the focal method. Our approach can handle units with ambiguous +or missing documentation, and even units with a missing implementation. We evaluate our approach on both oracle inference accuracy +and functional bug-finding. Our technique improves accuracy by +33% over existing oracle inference approaches, achieving 96% overall accuracy on a held out test dataset. Furthermore, we show that +when integrated with a automated test generation tool (EvoSuite), +our approach finds 57 real world bugs in large-scale Java programs, +including 30 bugs that are not found by any other automated testing +method in our evaluation +	code generation Transformer test generation
2022	A Systematic Evaluation of Large Language Models of Code + + + + +	Frank F. Xu, Uri Alon, Graham Neubig, Vincent J. Hellendoorn		Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at this https URL, which enables future research and application in this area. +	Transformer language model
2022	CoditT5: Pretraining for Source Code and Natural Language Editing + + + + +	Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, Milos Gligoric		Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming pure generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a pure generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks. +	Transformer edit
2022	SelfAPR: Self-supervised Program Repair with Test Execution Diagnostics + + + + +	He Ye, Matias Martinez, Xiapu Luo, Tao Zhang, Martin Monperrus		Neural program repair has achieved good results in a recent series of papers. Yet, we observe that the related work fails to repair some bugs because of a lack of knowledge about 1) the program being repaired, and 2) the actual fault being repaired. In this paper, we solve both problems by changing the learning paradigm from supervised training to self-supervised training in an approach called SelfAPR. First, SelfAPR generates and constructs training samples by perturbing a previous version of the program being repaired, enforcing the neural model to capture project-specific knowledge. This is different from all the existing work based on past commits. Second, SelfAPR extracts and encodes test execution diagnostics into the input representation, steering the neural model to fix the specific kind of fault. This is different from the existing studies that only consider static source code in the input. We implement SelfAPR and evaluate it in a systematic manner. We train SelfAPR with 253 411 training samples obtained by perturbing 17 open-source projects. We evaluate SelfAPR on 818 bugs from Defects4J, SelfAPR correctly repairs 112 of them. +	repair execution
2022	Natural Language to Code Generation in Interactive Data Science Notebooks + + + + +	Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, Charles Sutton		Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions. +	notebook evaluation
2022	CodeT: Code Generation with Generated Tests + + + + +	Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen		Given a programming problem, pre-trained language models such as Codex have demonstrated the ability to generate multiple different code solutions via sampling. However, selecting a correct or best solution from those samples still remains a challenge. While an easy way to verify the correctness of a code solution is through executing test cases, producing high-quality test cases is prohibitively expensive. In this paper, we explore the use of pre-trained language models to automatically generate test cases, calling our method CodeT: Code generation with generated Tests. CodeT executes the code solutions using the generated test cases, and then chooses the best solution based on a dual execution agreement with both the generated test cases and other generated solutions. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks. Extensive experimental results demonstrate CodeT can achieve significant, consistent, and surprising improvements over previous methods. For example, CodeT improves the pass@1 on HumanEval to 65.8%, an increase of absolute 18.8% on the code-davinci-002 model, and an absolute 20+% improvement over previous state-of-the-art results. +	synthesis Transformer execution
2022	Learning to Reverse DNNs from AI Programs Automatically + + + + +	Simin Chen, Hamed Khanpour, Cong Liu, Wei Yang	IJCAI-ECAI 2022	With the privatization deployment of DNNs on edge devices, the security of on-device DNNs has raised significant concern. To quantify the model leakage risk of on-device DNNs automatically, we propose NNReverse, the first learning-based method which can reverse DNNs from AI programs without domain knowledge. NNReverse trains a representation model to represent the semantics of binary code for DNN layers. By searching the most similar function in our database, NNReverse infers the layer type of a given function’s binary code. To represent assembly instructions semantics precisely, NNReverse proposes a more finegrained embedding model to represent the textual and structural-semantic of assembly functions. +	Reverse Engineering Binary Code
2022	Exploring and Evaluating Personalized Models for Code Generation + + + + +	Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, Michele Tufano	FSE	Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain – for example, question-answering on a given topic – generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model’s parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios. +	Transformer
2022	An Extensive Study on Pre-trained Models for Program Understanding and Generation + + + + +	Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, Lingming Zhang	ISSTA	Automatic program understanding and generation techniques could +significantly advance the productivity of programmers and have +been widely studied by academia and industry. Recently, the advent of pre-trained paradigm enlightens researchers to develop +general-purpose pre-trained models which can be applied for a +broad range of program understanding and generation tasks. Such +pre-trained models, derived by self-supervised objectives on large +unlabelled corpora, can be fine-tuned in downstream tasks (such +as code search and code generation) with minimal adaptations. Although these pre-trained models claim superiority over the prior +techniques, they seldom follow equivalent evaluation protocols, e.g., +they are hardly evaluated on the identical benchmarks, tasks, or settings. Consequently, there is a pressing need for a comprehensive +study of the pre-trained models on their effectiveness, versatility +as well as the limitations to provide implications and guidance for +the future development in this area. To this end, we first perform +an extensive study of eight open-access pre-trained models over +a large benchmark on seven representative code tasks to assess +their reproducibility. We further compare the pre-trained models +and domain-specific state-of-the-art techniques for validating pre-trained effectiveness. At last, we investigate the robustness of the +pre-trained models by inspecting their performance variations under adversarial attacks. Through the study, we find that while we +can in general replicate the original performance of the pre-train +models on their evaluated tasks and adopted benchmarks, subtle +performance fluctuations can refute the findings in their original +papers. Moreover, none of the existing pre-trained models can dominate over all other models. We also find that the pre-trained models +can significantly outperform non-pre-trained state-of-the-art techniques in program understanding tasks. Furthermore, we perform +the first study for natural language-programming language pre-trained model robustness via adversarial attacks and find that a +simple random attack approach can easily fool the state-of-the-art +pre-trained models and thus incur security issues. At last, we also +provide multiple practical guidelines for advancing future research +on pre-trained models for program understanding and generation. +	Transformer evaluation
2022	Learning to Answer Semantic Queries over Code + + + + +	Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya Kanade, Petros Maniatis, Shirish Shevade		During software development, developers need answers to queries about semantic aspects of code. Even though extractive question-answering using neural approaches has been studied widely in natural languages, the problem of answering semantic queries over code using neural networks has not yet been explored. This is mainly because there is no existing dataset with extractive question and answer pairs over code involving complex concepts and long chains of reasoning. We bridge this gap by building a new, curated dataset called CodeQueries, and proposing a neural question-answering methodology over code. +We build upon state-of-the-art pre-trained models of code to predict answer and supporting-fact spans. Given a query and code, only some of the code may be relevant to answer the query. We first experiment under an ideal setting where only the relevant code is given to the model and show that our models do well. We then experiment under three pragmatic considerations: (1) scaling to large-size code, (2) learning from a limited number of examples and (3) robustness to minor syntax errors in code. Our results show that while a neural model can be resilient to minor syntax errors in code, increasing size of code, presence of code that is not relevant to the query, and reduced number of training examples limit the model performance. We are releasing our data and models to facilitate future work on the proposed problem of answering semantic queries over code. +	static analysis Transformer
2022	What is it like to program with artificial intelligence? + + + + +	Advait Sarkar, Andrew D. Gordon, Carina Negreanu, Christian Poelitz, Sruti Srinivasa Ragavan, Ben Zorn		Large language models, such as OpenAI’s codex and Deepmind’s AlphaCode, can generate code to solve a variety of problems expressed in natural language. This technology has already been commercialised in at least one widely-used programming editor extension: GitHub Copilot. + + In this paper, we explore how programming with large language models (LLM-assisted programming) is similar to, and differs from, prior conceptualisations of programmer assistance. We draw upon publicly available experience reports of LLM-assisted programming, as well as prior usability and design studies. We find that while LLM-assisted programming shares some properties of compilation, pair programming, and programming via search and reuse, there are fundamental differences both in the technical possibilities as well as the practical experience. Thus, LLM-assisted programming ought to be viewed as a new way of programming with its own distinct properties and challenges. + + Finally, we draw upon observations from a user study in which non-expert end user programmers use LLM-assisted tools for solving data tasks in spreadsheets. We discuss the issues that might arise, and open research challenges, in applying large language models to end-user programming, particularly with users who have little or no programming expertise. +	human evaluation review
2022	Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes + + + + +	Cedric Richter, Heike Wehrheim		Real bug fixes found in open source repositories seem to be the perfect source for learning to localize and repair real bugs. However, the absence of large scale bug fix collections has made it difficult to effectively exploit real bug fixes in the training of larger neural models in the past. In contrast, artificial bugs – produced by mutating existing source code – can be easily obtained at a sufficient scale and are therefore often preferred in the training of existing approaches. Still, localization and repair models that are trained on artificial bugs usually underperform when faced with real bugs. This raises the question whether bug localization and repair models trained on real bug fixes are more effective in localizing and repairing real bugs. + + We address this question by introducing RealiT, a pre-train-and-fine-tune approach for effectively learning to localize and repair real bugs from real bug fixes. RealiT is first pre-trained on a large number of artificial bugs produced by traditional mutation operators and then fine-tuned on a smaller set of real bug fixes. Fine-tuning does not require any modifications of the learning algorithm and hence can be easily adopted in various training scenarios for bug localization or repair (even when real training data is scarce). In addition, we found that training on real bug fixes with RealiT is empirically powerful by nearly doubling the localization performance of an existing model on real bugs while maintaining or even improving the repair performance. +	Transformer repair defect
2022	Backdoors in Neural Models of Source Code + + + + +	Goutham Ramakrishnan, Aws Albarghouthi	ICPR	Deep neural networks are vulnerable to a range of adversaries. A particularly pernicious class of vulnerabilities are backdoors, where model predictions diverge in the presence of subtle triggers in inputs. An attacker can implant a backdoor by poisoning the training data to yield a desired target prediction on triggered inputs. We study backdoors in the context of deep-learning for source code. (1) We define a range of backdoor classes for source-code tasks and show how to poison a dataset to install such backdoors. (2) We adapt and improve recent algorithms from robust statistics for our setting, showing that backdoors leave a spectral signature in the learned representation of source code, thus enabling detection of poisoned data. (3) We conduct a thorough evaluation on different architectures and languages, showing the ease of injecting backdoors and our ability to eliminate them. +	adversarial
2022	Learning to Model Editing Processes + + + + +	Machel Reid, Graham Neubig		Most existing sequence generation models produce outputs in one pass, usually left-to-right. However, this is in contrast with a more natural approach that humans use in generating content; iterative refinement and editing. Recent work has introduced edit-based models for various tasks (such as neural machine translation and text style transfer), but these generally model a single edit step. In this work, we propose modeling editing processes, modeling the whole process of iteratively generating sequences. We form a conceptual framework to describe the likelihood of multi-step edits, and describe neural models that can learn a generative model of sequences based on these multistep edits. We introduce baseline results and metrics on this task, finding that modeling editing processes improves performance on a variety of axes on both our proposed task and related downstream tasks compared to previous single-step models of edits. +	Transformer edit
2022	Memorization and Generalization in Neural Code Intelligence Models + + + + +	Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour, Vincent J. Hellendoorn	IST	Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset and use various metrics to quantify the impact of noise on various aspects of training and testing. We evaluate several state-of-the-art neural code intelligence models and benchmarks based on Java, Python, and Ruby codebases. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. We observed all models manifest some forms of memorization. This can be potentially troublesome in most code intelligence tasks where they rely on rather noise-prone and repetitive data sources, such as code from GitHub. To the best of our knowledge, we provide the first study to quantify memorization effects in the domain of software engineering and code intelligence systems. This work raises awareness and provides new insights into important issues of training neural models in code intelligence systems that are usually overlooked by software engineering researchers. +	evaluation memorization generalizability refactoring language model
2022	Synchromesh: Reliable code generation from pre-trained language models + + + + +	Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, Sumit Gulwani	ICLR	Large pre-trained language models have been used to generate code,providing a flexible interface for synthesizing programs from natural language specifications. However, they often violate syntactic and semantic rules of their output language, limiting their practical usability. In this paper, we propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation. Synchromesh comprises two components. First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection. TST learns to recognize utterances that describe similar target programs despite differences in surface natural language features. Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD): a general framework for constraining the output to a set of valid programs in the target language. CSD leverages constraints on partial outputs to sample complete correct programs, and needs neither re-training nor fine-tuning of the language model. We evaluate our methods by synthesizing code from natural language descriptions using GPT-3 and Codex in three real-world languages: SQL queries, Vega-Lite visualizations and SMCalFlow programs. These domains showcase rich constraints that CSD is able to enforce, including syntax, scope, typing rules, and contextual logic. We observe substantial complementary gains from CSD and TST in prediction accuracy and in effectively preventing run-time errors. +	Transformer language model
2022	Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models + + + + +	Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour	MAPS	Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI models. However, this approach is syntax-unaware and does not consider the grammar of the programming language. In this paper, we apply a syntax-guided program reduction technique that considers the grammar of the input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique is faster and provides smaller sets of key tokens in reduced programs. We also show that the key tokens could be used in generating adversarial examples for up to 65% of the input programs. +	interpretability refactoring adversarial
2022	Exploring Dimensions of Generalizability and Few-shot Transfer for Text-to-SQL Semantic Parsing + + + + +	Rajaswa Patil, Manasi Patwardhan, Shirish Karande, Lovekesh Vig, Gautam Shroff	The 1st Transfer Learning for Natural Language Processing Workshop (TL4NLP 2022)	Existing work on generalization in Text-to-SQL semantic parsing has been restricted to a zero-shot cross-domain setting. In this paper, we introduce Spider-Gen: a Text-to-SQL benchmark to develop a paradigm of transfer learning across distinct dimensions of generalization in Text-to-SQL semantic parsing. The Spider-Gen benchmark focuses on few-shot adaption for Cross-domain, Lexical, and Structural generalization of Text-to-SQL models. Through our experiments with the Spider-Gen dataset, we show that Seq2Seq language models struggle to generalize against change in data distribution, lexical changes in database schema, and changes in SQL query complexity. Our experiments also reveal that performing few-shot fine-tuning helps Text-to-SQL models to generalize across these changes. However, such few-shot adaptation comes with a negative effect on the knowledge learnt during training. Hence, we also explore Parameter-efficient Fine-tuning methods to overcome the limitations of Seq2Seq Text-to-SQL models. We release the Spider-Gen dataset publicly to facilitate further research in generalization and transfer learning across various dimensions in Text-to-SQL semantic parsing. +	dataset evaluation Transformer benchmark generalizability
2022	CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation + + + + +	Pardis Pashakhanloo, Aaditya Naik, Yuepeng Wang, Hanjun Dai, Petros Maniatis, Mayur Naik	ICLR	Designing a suitable representation for code-reasoning tasks is challenging in aspects such as the kinds of program information to model, how to combine them, and how much context to consider. We propose CodeTrek, a deep learning approach that addresses these challenges by representing codebases as databases that conform to rich relational schemas. The relational representation not only allows CodeTrek to uniformly represent diverse kinds of program information, but also to leverage program-analysis queries to derive new semantic relations, which can be readily incorporated without further architectural engineering. CodeTrek embeds this relational representation using a set of walks that can traverse different relations in an unconstrained fashion, and incorporates all relevant attributes along the way. We evaluate CodeTrek on four diverse and challenging Python tasks: variable misuse, exception prediction, unused definition, and variable shadowing. +CodeTrek achieves an accuracy of 91%, 63%, 98%, and 94% on these tasks respectively, and outperforms state-of-the-art neural models by 2-19% points. +	representation variable misuse
2022	Using Developer Discussions to Guide Fixing Bugs in Software + + + + +	Sheena Panthaplackel, Milos Gligoric, Junyi Jessy Li, Raymond J. Mooney	EMNLP	Automatically fixing software bugs is a challenging task. While recent work showed that natural language context is useful in guiding bug-fixing models, the approach required prompting developers to provide this context, which was simulated through commit messages written after the bug-fixing code changes were made. We instead propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for any additional information from developers. For this, we augment standard bug-fixing datasets with bug report discussions. Using these newly compiled datasets, we demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits. +	Transformer repair
2022	Making the Most of Scarce Input Data in Deep Learning-Based Source Code Classification for Heterogeneous Device Mapping + + + + +	Emanuele Parisi, Francesco Barchi, Andrea Bartolini, Andrea Acquaviva		Despite its relatively recent history, deep learning (DL)-based source code analysis is already a cornerstone in machine learning for compiler optimization. When applied to the classification of pieces of code to identify the best computational unit in a heterogeneous Systems-on-Chip, it can be effective in supporting decisions that a programmer has otherwise to take manually. Several techniques have been proposed exploiting different networks and input information, prominently sequence-based and graph-based representations, complemented by auxiliary information typically related to payload and device configuration. While the accuracy of DL methods strongly depends on the training and test datasets, so far no exhaustive and statistically meaningful analysis has been done on its impact on the results and on how to effectively extract the available information. This is relevant also considering the scarce availability of source code datasets that can be labeled by profiling on heterogeneous compute units. In this article, we first present such a study, which leads us to devise the contribution of code sequences and auxiliary inputs separately. Starting from this analysis, we then demonstrate that by using the normalization of auxiliary information, it is possible to improve state-of-the-art results in terms of accuracy. Finally, we propose a novel approach exploiting Siamese networks that further improve mapping accuracy by increasing the cardinality of the dataset, thus compensating for its relatively small size. +	optimization program analysis static analysis language model
2022	A Conversational Paradigm for Program Synthesis + + + + +	Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong		Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI’s Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: https://github.com/salesforce/CodeGen. +	Transformer synthesis
2022	SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations + + + + +	Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, Bin Luo	ICSE	Recent years have seen the successful application of large pre-trained modelsto code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding theirapplication to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pre-training tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existingpre-training tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstreamt asks. Experimental results demonstrate that SPT-Code achieves state-of-the-artperformance on five code-related downstream tasks after fine-tuning. +	Transformer representation
2022	Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis + + + + +	Shounak Naik, Rajaswa Patil, Swati Agarwal, Veeky Baths	International Conference on Advanced Data Mining and Applications (ADMA 2022)	Representational Similarity Analysis is a method from cognitive neuroscience, which helps in comparing representations from two different sources of data. In this paper, we propose using Representational Similarity Analysis to probe the semantic grounding in language models of code. We probe representations from the CodeBERT model for semantic grounding by using the data from the IBM CodeNet dataset. Through our experiments, we show that current pre-training methods do not induce semantic grounding in language models of code, and instead focus on optimizing form-based patterns. We also show that even a little amount of fine-tuning on semantically relevant tasks increases the semantic grounding in CodeBERT significantly. Our ablations with the input modality to the CodeBERT model show that using bimodal inputs (code and natural language) over unimodal inputs (only code) gives better semantic grounding and sample efficiency during semantic fine-tuning. Finally, our experiments with semantic perturbations in code reveal that CodeBERT is able to robustly distinguish between semantically correct and incorrect code. +	interpretability language model evaluation Transformer
2022	CodeDSI: Differentiable Code Search + + + + +	Usama Nadeem, Noah Ziems, Shaoen Wu		Reimplementing solutions to previously solved software engineering problems is not only inefficient but also introduces inadequate and error-prone code. Many existing methods achieve impressive performance on this issue by using autoregressive text-generation models trained on code. However, these methods are not without their flaws. The generated code from these models can be buggy, lack documentation, and introduce vulnerabilities that may go unnoticed by developers. An alternative to code generation – neural code search – is a field of machine learning where a model takes natural language queries as input and, in turn, relevant code samples from a database are returned. Due to the nature of this pre-existing database, code samples can be documented, tested, licensed, and checked for vulnerabilities before being used by developers in production. In this work, we present CodeDSI, an end-to-end unified approach to code search. CodeDSI is trained to directly map natural language queries to their respective code samples, which can be retrieved later. In an effort to improve the performance of code search, we have investigated docid representation strategies, impact of tokenization on docid structure, and dataset sizes on overall code search performance. Our results demonstrate CodeDSI strong performance, exceeding conventional robust baselines by 2-6% across varying dataset sizes. +	search
2022	Code Translation with Compiler Representations + + + + +	Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve		In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java - Rust pair. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation. +	Transformer migration decompilation
2022	Learning Program Semantics with Code Representations: An Empirical Study + + + + +	Jing Kai Siow, Shangqing Liu, Xiaofei Xie, Guozhu Meng, Yang Liu	SANER	Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-art performance. However, currently, a comprehensive and systematic study on evaluating different program representation techniques across diverse tasks is still missed. + + From this starting point, in this paper, we conduct an empirical study to evaluate different program representation techniques. Specifically, we categorize current mainstream code representation techniques into four categories i.e., Feature-based, Sequence-based, Tree-based, and Graph-based program representation technique and evaluate its performance on three diverse and popular code intelligent tasks i.e., {Code Classification}, Vulnerability Detection, and Clone Detection on the public released benchmark. We further design three {research questions (RQs)} and conduct a comprehensive analysis to investigate the performance. By the extensive experimental results, we conclude that (1) The graph-based representation is superior to the other selected techniques across these tasks. (2) Compared with the node type information used in tree-based and graph-based representations, the node textual information is more critical to learning the program semantics. (3) Different tasks require the task-specific semantics to achieve their highest performance, however combining various program semantics from different dimensions such as control dependency, data dependency can still produce promising results. +	representation
2022	Repository-Level Prompt Generation for Large Language Models of Code + + + + +	Disha Shrivastava, Hugo Larochelle, Daniel Tarlow		With the success of large language models (LLMs) of code and their use as code assistants (e.g. Codex used in GitHub Copilot), techniques for introducing domain-specific knowledge in the prompt design process become important. In this work, we propose a framework called Repo-Level Prompt Generator that learns to generate example-specific prompts using a set of rules. These rules take context from the entire repository, thereby incorporating both the structure of the repository and the context from other relevant files (e.g. imports, parent class files). Our technique doesn’t require any access to the weights of the LLM, making it applicable in cases where we only have black-box access to the LLM. We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives. We demonstrate that an oracle constructed from our proposed rules gives up to 36% relative improvement over Codex, showing the quality of the rules. Further, we show that when we train a model to select the best rule, we can achieve significant performance gains over Codex. The code for our work can be found at: https://github.com/shrivastavadisha/repo_level_prompt_generation . +	Transformer code completion
2022	Senatus - A Fast and Accurate Code-to-Code Recommendation Engine + + + + +	Fran Silavong, Sean Moran, Antonios Georgiadis, Rohan Saphal, Robert Otter	MSR	Machine learning on source code (MLOnCode) is a popular research field that has been driven by the availability of large-scale code repositories and the development of powerful probabilistic and deep learning models for mining source code. Code-to-code recommendation is a task in MLOnCode that aims to recommend relevant, diverse and concise code snippets that usefully extend the code currently being written by a developer in their development environment (IDE). Code-to-code recommendation engines hold the promise of increasing developer productivity by reducing context switching from the IDE and increasing code-reuse. Existing code-to-code recommendation engines do not scale gracefully to large codebases, exhibiting a linear growth in query time as the code repository increases in size. In addition, existing code-to-code recommendation engines fail to account for the global statistics of code repositories in the ranking function, such as the distribution of code snippet lengths, leading to sub-optimal retrieval results. We address both of these weaknesses with Senatus, a new code-to-code recommendation engine. At the core of Senatus is De-Skew LSH a new locality sensitive hashing (LSH) algorithm that indexes the data for fast (sub-linear time) retrieval while also counteracting the skewness in the snippet length distribution using novel abstract syntax tree-based feature scoring and selection algorithms. We evaluate Senatus and find the recommendations to be of higher quality than competing baselines, while achieving faster search. For example on the CodeSearchNet dataset Senatus improves performance by 31.21% F1 and 147.9x faster query time compared to Facebook Aroma. Senatus also outperforms standard MinHash LSH by 29.2% F1 and 51.02x faster query time. +	code similarity search
2022	CV4Code: Sourcecode Understanding via Visual Code Representations + + + + +	Ruibo Shi, Lili Tao, Rohan Saphal, Fran Silavong, Sean J. Moran		We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task. +	code similarity Transformer
2022	An Exploratory Study on Code Attention in BERT + + + + +	Rishab Sharma, Fuxiang Chen, Fatemeh H. Fard, David Lo	ICPC	Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled. Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers’ embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21–24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance. + +	Transformer representation language model interpretability pretraining clone
2022	What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code + + + + +	Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, Hai Jin	ICSE	Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations. +	Transformer pretraining program analysis
2022	Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models + + + + +	Priyan Vaithilingam, Tianyi Zhang, Elena Glassman	CHI	Recent advances in Large Language Models (LLM) have made automatic code generation possible for real-world programming tasks in +general-purpose programming languages such as Python. However, +there are few human studies on the usability of these tools and how +they fit the programming workflow. In this work, we conducted +a within-subjects user study with 24 participants to understand +how programmers use and perceive Copilot, a LLM-based code +generation tool. We found that, while Copilot did not necessarily +improve the task completion time or success rate, most participants preferred to use Copilot in daily programming tasks, since +Copilot often provided a useful starting point and saved the effort +of searching online. However, participants did face difficulties in +understanding, editing, and debugging code snippets generated +by Copilot, which significantly hindered their task-solving effectiveness. Finally, we highlighted several promising directions for +improving the design of Copilot based on our observations and +participants’ feedback. +	human evaluation code generation language model
2022	LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition + + + + +	Rishab Sharma, Fuxiang Chen, Fatemeh H. Fard	ICPC	Code comment generation is the task of generating a high-level natural language description for a given code method/function. Although researchers have been studying multiple ways to generate code comments automatically, previous work mainly considers representing a code token in its entirety semantics form only (e.g., a language model is used to learn the semantics of a code token), and additional code properties such as the tree structure of a code are included as an auxiliary input to the model. There are two limitations: 1) Learning the code token in its entirety form may not be able to capture information succinctly in source code, and 2)The code token does not contain additional syntactic information, inherently important in programming languages. In this paper, we present LAnguage Model and Named Entity Recognition (LAMNER), a code comment generator capable of encoding code constructs effectively and capturing the structural property of a code token. A character-level language model is used to learn the semantic representation to encode a code token. For the structural property of a token, a Named Entity Recognition model is trained to learn the different types of code tokens. These representations are then fed into an encoder-decoder architecture to generate code comments. We evaluate the generated comments from LAMNER and other baselines on a popular Java dataset with four commonly used metrics. Our results show that LAMNER is effective and improves over the best baseline model in BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR, and CIDEr by 14.34%, 18.98%, 21.55%, 23.00%, 10.52%, 1.44%, and 25.86%, respectively. Additionally, we fused LAMNER’s code representation with the baseline models, and the fused models consistently showed improvement over the nonfused models. The human evaluation further shows that LAMNER produces high-quality code comments. + +	summarization documentation language model types representation
2022	Using Deep Learning to Generate Complete Log Statements + + + + +	Antonio Mastropaolo, Luca Pascarella, Gabriele Bavota		Logging is a practice widely adopted in several phases of the software lifecycle. For example, during software development log statements allow engineers to verify and debug the system by exposing fine-grained information of the running software. While the benefits of logging are undisputed, taking proper decisions about where to inject log statements, what information to log, and at which log level (e.g., error, warning) is crucial for the logging effectiveness. In this paper, we present LANCE (Log stAtemeNt reCommEnder), the first approach supporting developers in all these decisions. LANCE features a Text-To-Text-Transfer-Transformer (T5) model that has been trained on 6,894,456 Java methods. LANCE takes as input a Java method and injects in it a full log statement, including a human-comprehensible logging message and properly choosing the needed log level and the statement location. Our results show that LANCE is able to (i) properly identify the location in the code where to inject the statement in 65.9% of Java methods requiring it; (ii) selecting the proper log level in 66.2% of cases; and (iii) generate a completely correct log statement including a meaningful logging message in 15.2% of cases. +	Transformer logging
2022	All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs + + + + +	Vitaliy Bibaev, Alexey Kalina, Vadim Lomshakov, Yaroslav Golubev, Alexander Bezzubov, Nikita Povarov, Timofey Bryksin	ESEC/FSE	We propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. +We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. +We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. +Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. +Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. +Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832. +The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client’s side. +Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020. +	autocomplete
2022	Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions + + + + +	David Bieber, Rishab Goel, Daniel Zheng, Hugo Larochelle, Daniel Tarlow		The execution behavior of a program often depends on external resources, such as program inputs or file contents, and so cannot be run in isolation. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors as early as possible, even before programs can be compiled and run. This presents an interesting machine learning challenge: can we predict runtime errors in a “static” setting, where program execution is not possible? Here, we introduce a real-world dataset and task for predicting runtime errors, which we show is difficult for generic models like Transformers. We approach this task by developing an interpreter-inspired architecture with an inductive bias towards mimicking program executions, which models exception handling and “learns to execute” descriptions of the contents of external resources. Surprisingly, we show that the model can also predict the location of the error, despite being trained only on labels indicating the presence/absence and kind of error. In total, we present a practical and difficult-yet-approachable challenge problem related to learning program execution and we demonstrate promising new capabilities of interpreter-inspired machine learning models for code. +	dataset defect
2022	Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code + + + + +	Patrick Bareiß, Beatriz Souza, Marcelo d'Amorim, Michael Pradel		Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code, e.g., how to complete a given code example, or even generate code snippets from scratch. The success of these models raises the question whether they could serve as a basis for building a wide range code generation tools. Traditionally, such tools are built manually and separately for each task. Instead, few-shot learning may allow to obtain different tools from a single pre-trained language model by simply providing a few examples or a natural language description of the expected tool behavior. This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose. We consider three code manipulation and code generation tasks targeted by a range of traditional tools: (i) code mutation; (ii) test oracle generation from natural language documentation; and (iii) test case generation. For each task, we compare few-shot learning to a manually built tool. Our results show that the model-based tools complement (code mutation), are on par (test oracle generation), or even outperform their respective traditionally built tool (test case generation), while imposing far less effort to develop them. By comparing the effectiveness of different variants of the model-based tools, we provide insights on how to design an appropriate input (“prompt”) to the model and what influence the size of the model has. For example, we find that providing a small natural language description of the code generation task is an easy way to improve predictions. Overall, we conclude that few-shot language models are surprisingly effective, yet there is still more work to be done, such as exploring more diverse ways of prompting and tackling even more involved tasks. +	Transformer
2022	Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities + + + + +	Francesco Barchi, Emanuele Parisi, Andrea Bartolini, Andrea Acquaviva		To cope with the increasing complexity of digital systems programming, deep learning techniques have recently been proposed to enhance software deployment by analysing source code for different purposes, ranging from performance and energy improvement to debugging and security assessment. As embedded platforms for cyber-physical systems are characterised by increasing heterogeneity and parallelism, one of the most challenging and specific problems is efficiently allocating computational kernels to available hardware resources. In this field, deep learning applied to source code can be a key enabler to face this complexity. However, due to the rapid development of such techniques, it is not easy to understand which of those are suitable and most promising for this class of systems. For this purpose, we discuss recent developments in deep learning for source code analysis, and focus on techniques for kernel mapping on heterogeneous platforms, highlighting recent results, challenges and opportunities for their applications to cyber-physical systems. +	optimization review
2022	Grounded Copilot: How Programmers Interact with Code-Generating Models + + + + +	Shraddha Barke, Michael B. James, Nadia Polikarpova		Powered by recent advances in code-generating models, AI assistants like Github Copilot promise to change the face of programming forever. But what is this new face of programming? We present the first grounded theory analysis of how programmers interact with Copilot, based on observing 20 participants–with a range of prior experience using the assistant–as they solve diverse programming tasks across four languages. Our main finding is that interactions with programming assistants are bimodal: in acceleration mode, the programmer knows what to do next and uses Copilot to get there faster; in exploration mode, the programmer is unsure how to proceed and uses Copilot to explore their options. Based on our theory, we provide recommendations for improving the usability of future AI programming assistants. +	human evaluation synthesis
2022	SantaCoder: don’t reach for the stars! + + + + +	Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muenninghoff, Mayank Mishra, Alex Gu, Manan Den, Longesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Terry Yue Zhuo, Francesco De Toni, Bernanrdo Garcia del Rio, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Michael Lappert, Ian Yu, Paulo Villegas, Jia Li, David Lansy, Huu Nguyen, Danish Contractor, Luis Villa, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Arjun Guha, Harm de Vries, Leonadro von Werra		The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code.1 This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) +redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, +JavaScript, and Python subsets of The Stack (Kocetkov et al., 2022) and +evaluate the models on MultiPL-E (Cassano et al., 2022), a text2code +benchmark available in 18 programming languages. We find that more +aggressive filtering of near-duplicates can further boost performance and, +surprisingly, that selecting files from repositories with 5+ GitHub stars +deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and +CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the +Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL +license at https://hf.co/bigcode +	Transformer
2022	Efficient Training of Language Models to Fill in the Middle + + + + +	Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, Mark Chen		We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research. +	Transformer language model
2022	Learning code summarization from a small and local dataset + + + + +	Toufique Ahmed, Premkumar Devanbu		Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python. +	Transformer summarization
2022	DocCoder: Generating Code by Retrieving and Reading Docs + + + + +	Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao JIang, Graham Neubig		Natural-language-to-code models learn to generate a code snippet given a natural language (NL) intent. However, the rapid growth of both publicly available and proprietary libraries and functions makes it impossible to cover all APIs using training examples, as new libraries and functions are introduced daily. Thus, existing models inherently cannot generalize to using unseen functions and libraries merely through incorporating them into the training data. In contrast, when human programmers write programs, they frequently refer to textual resources such as code manuals, documentation, and tutorials, to explore and understand available library functionality. Inspired by this observation, we introduce DocCoder: an approach that explicitly leverages code manuals and documentation by (1) retrieving the relevant documentation given the NL intent, and (2) generating the code based on the NL intent and the retrieved documentation. Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model. We demonstrate that DocCoder consistently improves NL-to-code models: DocCoder achieves 11x higher exact match accuracy than strong baselines on a new Bash dataset tldr; on the popular Python CoNaLa benchmark, DocCoder improves over strong baselines by 1.65 BLEU. +	Transformer search code generation
2021	CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation + + + + +	Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu		Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems. +	benchmark Transformer
2021	Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors + + + + +	Junayed Mahmud, Fahim Faisal, Raihan Islam Arnob, Antonios Anastasopoulos, Kevin Moran	NLP4Prog	Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to “translate” code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an error taxonomy that can be used to drive future research efforts. +	survey summarization Transformer
2021	Language-Agnostic Representation Learning of Source Code from Structure and Context + + + + +	Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, Stephan Günnemann	ICLR	Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code. +	Transformer representation
2021	Shellcode_IA32: A Dataset for Automatic Shellcode Generation + + + + +	Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella, Bojan Cukic, Samira Shaikh	NLP4Prog	We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability, starting from natural language comments. We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions. We experiment with standard methods in neural machine translation (NMT) to establish baseline performance levels on this task. +	code generation dataset
2021	Toward Less Hidden Cost of Code Completion with Acceptance and Ranking Models + + + + +	Jingxuan Li, Rui Huang, Wei Li, Kai Yao, Weiguo Tan	ICSME	Code completion is widely used by software developers to provide coding suggestions given a partially written code snippet. Apart from the traditional code completion methods, which only support single token completion at minimal positions, recent studies show the ability to provide longer code completion at more flexible positions. However, such frequently triggered and longer completion results reduce the overall precision as they generate more invalid results. Moreover, different studies are mostly incompatible with each other. Thus, it is vital to develop an ensemble framework that can combine results from multiple models to draw merits and offset defects of each model. +This paper conducts a coding simulation to collect data from code context and different code completion models and then apply the data in two tasks. First, we introduce an acceptance model which can dynamically control whether to display completion results to the developer. It uses simulation features to predict whether correct results exist in the output of these models. Our best model reduces the percentage of false-positive completion from 55.09% to 17.44%. Second, we design a fusion ranking scheme that can automatically identify the priority of the completion results and reorder the candidates from multiple code completion models. This scheme is flexible in dealing with various models, regardless of the type or the length of their completion results. We integrate this ranking scheme with two frequency models and a GPT-2 styled language model, along with the acceptance model to yield 27.80% and 37.64% increase in TOP1 and TOP5 accuracy, respectively. In addition, we propose a new code completion evaluation metric, Benefit-Cost Ratio(BCR), taking into account the benefit of keystrokes saving and hidden cost of completion list browsing, which is closer to real coder experience scenario. +	autocomplete language model optimization Transformer
2021	Learning to Extend Program Graphs to Work-in-Progress Code + + + + +	Xuechen Li, Chris J. Maddison, Daniel Tarlow		Source code spends most of its time in a broken or incomplete state during software development. This presents a challenge to machine learning for code, since high-performing models typically rely on graph structured representations of programs derived from traditional program analyses. Such analyses may be undefined for broken or incomplete code. We extend the notion of program graphs to work-in-progress code by learning to predict edge relations between tokens, training on well-formed code before transferring to work-in-progress code. We consider the tasks of code completion and localizing and repairing variable misuse in a work-in-process scenario. We demonstrate that training relation-aware models with fine-tuned edges consistently leads to improved performance on both tasks. +	Transformer autocomplete repair
2021	Disentangled Code Representation Learning for Multiple Programming Languages + + + + +	Jingfeng Zhang, Haiwen Hong, Yin Zhang, Yao Wan, Ye Liu, Yulei Sui	ACL	Developing effective distributed representations of source code is fundamental yet challenging for many software engineering tasks such as code clone detection, code search, code translation and transformation. However, current code embedding approaches that represent the semantic and syntax of code in a mixed way are less interpretable and the resulting embedding can not be easily generalized across programming languages. In this paper, we propose a disentangled code representation learning approach to separate the semantic from the syntax of source code under a multi-programming-language setting, obtaining better interpretability and generalizability. Specially, we design three losses dedicated to the characteristics of source code to enforce the disentanglement effectively. We conduct comprehensive experiments on a real-world dataset composed of programming exercises implemented by multiple solutions that are semantically identical but grammatically distinguished. The experimental results validate the superiority of our proposed disentangled code representation, compared to several baselines, across three types of downstream tasks, i.e., code clone detection, code translation, and code-to-code search. +	representation
2021	Capturing Structural Locality in Non-parametric Language Models + + + + +	Frank F. Xu, Junxian He, Graham Neubig, Vincent J. Hellendoorn		Structural locality is a ubiquitous feature of real-world datasets, wherein data points are organized into local hierarchies. Some examples include topical clusters in text or project hierarchies in source code repositories. In this paper, we explore utilizing this structural locality within non-parametric language models, which generate sequences that reference retrieved examples from an external source. We propose a simple yet effective approach for adding locality information into such models by adding learned parameters that improve the likelihood of retrieving examples from local neighborhoods. Experiments on two different domains, Java source code and Wikipedia text, demonstrate that locality features improve model efficacy over models without access to these features, with interesting differences. We also perform an analysis of how and where locality features contribute to improved performance and why the traditionally used contextual similarity metrics alone are not enough to grasp the locality structure. +	language model
2021	Co-Training for Commit Classification + + + + +	Jian Yi, David Lee, Hai Leong Chieu	EMNLP WNUT	Commits in version control systems (e.g. Git) track changes in a software project. Commits comprise noisy user-generated natural language and code patches. Automatic commit classification (CC) has been used to determine the type of code maintenance activities performed, as well as to detect bug fixes in code repositories. Much prior work occurs in the fully-supervised setting – a setting that can be a stretch in resource-scarce situations presenting difficulties in labeling commits. In this paper, we apply co-training, a semi-supervised learning method, to take advantage of the two views available – the commit message (natural language) and the code changes (programming language) – to improve commit classification. +	Transformer bimodal defect
2021	Energy-Based Models for Code Generation under Compilability Constraints + + + + +	Tomasz Korbak, Hady Elsahar, Marc Dymetman, Germán Kruszewski	ACL	Neural language models can be successfully trained on source code, leading to applications such as code completion. However, their versatile autoregressive self-supervision objective overlooks important global sequence-level features that are present in the data such as syntactic correctness or compilability. In this work, we pose the problem of learning to generate compilable code as constraint satisfaction. We define an Energy-Based Model (EBM) representing a pre-trained generative model with an imposed constraint of generating only compilable sequences. We then use the KL-Adaptive Distributional Policy Gradient algorithm (Khalifa et al., 2021) to train a generative model approximating the EBM. We conduct experiments showing that our proposed approach is able to improve compilability rates without sacrificing diversity and complexity of the generated samples. +	code generation
2021	PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code + + + + +	Egor Spirin, Egor Bogomolov, Vladimir Kovalenko, Timofey Bryksin	MSR	The application of machine learning algorithms to source code has grown in the past years. Since these algorithms are quite sensitive to input data, it is not surprising that researchers experiment with input representations. Nowadays, a popular starting point to represent code is abstract syntax trees (ASTs). Abstract syntax trees have been used for a long time in various software engineering domains, and in particular in IDEs. The API of modern IDEs allows to manipulate and traverse ASTs, resolve references between code elements, etc. Such algorithms can enrich ASTs with new data and therefore may be useful in ML-based code analysis. In this work, we present PSIMINER— a tool for processing PSI trees from the IntelliJ Platform. PSI trees contain code syntax trees as well as functions to work with them, and therefore can be used to enrich code representation using static analysis algorithms of modern IDEs. To showcase this idea, we use our tool to infer types of identifiers in Java ASTs and extend the code2seq model for the method name prediction problem. +	tool
2021	What do pre-trained code models know about code? + + + + +	Anjan Karmakar, Romain Robbes		Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However, whether the vector representations from these pre-trained models comprehensively encode characteristics of source code well enough to be applicable to a broad spectrum of downstream tasks remains an open question. + + One way to investigate this is with diagnostic tasks called probes. In this paper, we construct four probing tasks (probing for surface-level, syntactic, structural, and semantic information) for pre-trained code models. We show how probes can be used to identify whether models are deficient in (understanding) certain code properties, characterize different model layers, and get insight into the model sample-efficiency. + + We probe four models that vary in their expected knowledge of code properties: BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow). While GraphCodeBERT performs more consistently overall, we find that BERT performs surprisingly well on some code tasks, which calls for further investigation. +	Transformer
2021	IdBench: Evaluating Semantic Representations of Identifier Names in Source Code + + + + +	Yaza Wainakh, Moiz Rauf, Michael Pradel	ICSE	Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of namebased analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., len and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 500 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to be similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation. +	representation
2021	TreeBERT: A Tree-Based Pre-Trained Model for Programming Language + + + + +	Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, Lei Lyu	UAI	Source code can be parsed into the abstract syntax tree (AST) based on defined syntax rules. However, in pre-training, little work has considered the incorporation of tree structure into the learning process. In this paper, we present TreeBERT, a tree-based pre-trained model for improving programming language-oriented generation tasks. To utilize tree structure, TreeBERT represents the AST corresponding to the code as a set of composition paths and introduces node position embedding. The model is trained by tree masked language modeling (TMLM) and node order prediction (NOP) with a hybrid objective. TMLM uses a novel masking strategy designed according to the tree’s characteristics to help the model understand the AST and infer the missing semantics of the AST. With NOP, TreeBERT extracts the syntactical structure by learning the order constraints of nodes in AST. We pre-trained TreeBERT on datasets covering multiple programming languages. On code summarization and code documentation tasks, TreeBERT outperforms other pre-trained models and state-of-the-art models designed for these tasks. Furthermore, TreeBERT performs well when transferred to the pre-trained unseen programming language. +	grammar Transformer
2021	CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model + + + + +	Tae Hwan Jung	NLP4Prog	Commit message is a document that summarizes source code changes in natural language. A good commit message clearly shows the source code changes, so this enhances collaboration between developers. Therefore, our work is to develop a model that automatically writes the commit message. To this end, we release 345K datasets consisting of code modification and commit messages in six programming languages (Python, PHP, Go, Java, JavaScript, and Ruby). Similar to the neural machine translation (NMT) model, using our dataset, we feed the code modification to the encoder input and the commit message to the decoder input and measure the result of the generated commit message with BLEU-4. Also, we propose the following two training methods to improve the result of generating the commit message: (1) A method of preprocessing the input to feed the code modification to the encoder input. (2) A method that uses an initial weight suitable for the code domain to reduce the gap in contextual representation between programming language (PL) and natural language (NL). +	dataset language model Transformer
2021	SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation + + + + +	Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, Xin Jiang		Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP), which are designed to predict identifiers, and edges between two nodes of AST, respectively. Meanwhile, to exploit the complementary information in semantically equivalent modalities (i.e., code, comment, AST) of the code, we propose a multi-modal contrastive learning strategy to maximize the mutual information among different modalities. Extensive experiments on four downstream tasks related to code intelligence show that SynCoBERT advances the state-of-the-art with the same pre-training corpus and model size. +	pretraining
2021	Learning Type Annotation: Is Big Data Enough? + + + + +	Kevin Jesse, Premkumar Devanbu, Toufique Ahmed	FSE	TypeScript is a widely used optionally-typed language where developers can adopt “pay as you go” typing: they can add types as +desired, and benefit from static typing. The “type annotation tax” +or manual effort required to annotate new or existing TypeScript +can be reduced by a variety of automatic methods. Probabilistic +machine-learning (ML) approaches work quite well. ML approaches +use different inductive biases, ranging from simple token sequences +to complex graphical neural network (GNN) models capturing syntax and semantic relations. More sophisticated inductive biases are +hand-engineered to exploit the formal nature of software. Rather +than deploying fancy inductive biases for code, can we just use “big +data” to learn natural patterns relevant to typing? We find evidence +suggesting that this is the case. We present TypeBert, demonstrating that even with simple token-sequence inductive bias used in +BERT-style models and enough data, type-annotation performance +of the most sophisticated models can be surpassed. +	Transformer types
2021	Multimodal Representation for Neural Code Search + + + + +	Jian Gu, Zimin Chen, Martin Monperrus	ICSME	Semantic code search is about finding semantically relevant code snippets for a given natural language query. In the state-of-the-art approaches, the semantic similarity between code and query is quantified as the distance of their representation in the shared vector space. In this paper, to improve the vector space, we introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. We conduct extensive experiments using a single corpus that is large-scale and multi-language: CodeSearchNet. Our results show that both our tree-serialized representations and multimodal learning model improve the performance of code search. Last, we define intuitive quantification metrics oriented to the completeness of semantic and syntactic information of the code data, to help understand the experimental findings. +	search representation
2021	Fix-Filter-Fix: Intuitively Connect Any Models for Effective Bug Fixing + + + + +	Haiwen Hong, Jingfeng Zhang, Yin Zhang, Yao Wan, Yulei Sui	EMNLP	Locating and fixing bugs is a time-consuming task. Most neural machine translation (NMT) based approaches for automatically bug fixing lack generality and do not make full use of the rich information in the source code. In NMT-based bug fixing, we find some predicted code identical to the input buggy code (called unchanged fix) in NMT-based approaches due to high similarity between buggy and fixed code (e.g., the difference may only appear in one particular line). Obviously, unchanged fix is not the correct fix because it is the same as the buggy code that needs to be fixed. Based on these, we propose an intuitive yet effective general framework (called Fix-Filter-Fix or Fˆ3) for bug fixing. Fˆ3 connects models with our filter mechanism to filter out the last model’s unchanged fix to the next. We propose an Fˆ3 theory that can quantitatively and accurately calculate the Fˆ3 lifting effect. To evaluate, we implement the Seq2Seq Transformer (ST) and the AST2Seq Transformer (AT) to form some basic Fˆ3 instances, called Fˆ3_ST+AT and Fˆ3_AT+ST. Comparing them with single model approaches and many model connection baselines across four datasets validates the effectiveness and generality of Fˆ3 and corroborates our findings and methodology. +	repair
2021	CoSQA: 20,000+ Web Queries for Code Search and Question Answering + + + + +	Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, Nan Duan	ACL	Finding codes given natural language query is beneficial to the productivity of software developers. +Future progress towards better semantic matching between query and code requires richer supervised training resources. +To remedy this, we introduce the CoSQA dataset. It includes 20,604 labels for pairs of natural language queries and codes, +each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%. +	dataset search
2021	Learning to Find Naming Issues with Big Code and Small Supervision + + + + +	Jingxuan He, Cheng-Chun Lee, Veselin Raychev, Martin Vechev	PLDI	We introduce a new approach for finding and fixing naming +issues in source code. The method is based on a careful +combination of unsupervised and supervised procedures: (i) +unsupervised mining of patterns from Big Code that express +common naming idioms. Program fragments violating such +idioms indicates likely naming issues, and (ii) supervised +learning of a classifier on a small labeled dataset which filters +potential false positives from the violations. + + We implemented our method in a system called +Namer and evaluated it on a large number of Python and Java programs. +We demonstrate that Namer is effective in finding naming mistakes +in real world repositories with high precision (∼70%). +Perhaps surprisingly, we also show that existing deep learning methods +are not practically effective and achieve low precision in finding naming issues (up to ∼16%). +	repair
2021	Mining Idioms in the Wild + + + + +	Aishwarya Sivaraman, Rui Abreu, Andrew Scott, Tobi Akomolede, Satish Chandra		Existing code repositories contain numerous instances of code patterns that are idiomatic ways of accomplishing a particular programming task. Sometimes, the programming language in use supports specific operators or APIs that can express the same idiomatic imperative code much more succinctly. However, those code patterns linger in repositories because the developers may be unaware of the new APIs or have not gotten around to them. Detection of idiomatic code can also point to the need for new APIs. + + We share our experiences in mine idiomatic patterns from the Hack repo at Facebook. We found that existing techniques either cannot identify meaningful patterns from syntax trees or require test-suite-based dynamic analysis to incorporate semantic properties to mine useful patterns. The key insight of the approach proposed in this paper – Jezero – is that semantic idioms from a large codebase can be learned from canonicalized dataflow trees. We propose a scalable, lightweight static analysis-based approach to construct such a tree that is well suited to mine semantic idioms using nonparametric Bayesian methods. + + Our experiments with Jezero on Hack code shows a clear advantage of adding canonicalized dataflow information to ASTs: Jezero was significantly more effective than a baseline that did not have the dataflow augmentation in being able to effectively find refactoring opportunities from unannotated legacy code. +	pattern mining refactoring
2021	On the Naturalness and Localness of Software Logs + + + + +	Sina Gholamian, Paul A. S. Ward		Logs are an essential part of the development and +maintenance of large and complex software systems as they +contain rich information pertaining to the dynamic content and +state of the system. As such, developers and practitioners rely +heavily on the logs to monitor their systems. In parallel, the +increasing volume and scale of the logs, due to the growing +complexity of modern software systems, renders the traditional +way of manual log inspection insurmountable. Consequently, to +handle large volumes of logs efficiently and effectively, various +prior research aims to automate the analysis of log files. Thus, in +this paper, we begin with the hypothesis that log files are natural +and local and these attributes can be applied for automating log +analysis tasks. We guide our research with six research questions +with regards to the naturalness and localness of the log files, and +present a case study on anomaly detection and introduce a tool +for anomaly detection, called ANALOG, to demonstrate how our +new findings facilitate the automated analysis of logs. +	logging language model
2021	CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing + + + + +	Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, Burkhard Rost		Currently, a growing number of mature natural language processing applications make people’s life more convenient. Such applications are built by source code - the language in software engineering. However, the applications for understanding source code language to ease the software engineering process are under-researched. Simultaneously, the transformer model, especially its combination with transfer learning, has been proven to be a powerful technique for natural language processing tasks. These breakthroughs point out a promising direction for process source code and crack software engineering tasks. This paper describes CodeTrans - an encoder-decoder transformer model for tasks in the software engineering domain, that explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks. Moreover, we have investigated the effect of different training strategies, including single-task learning, transfer learning, multi-task learning, and multi-task learning with fine-tuning. CodeTrans outperforms the state-of-the-art models on all the tasks. To expedite future works in the software engineering domain, we have published our pre-trained models of CodeTrans. +	Transformer
2021	DreamCoder: bootstrapping inductive program synthesis with wake-sleep library learning + + + + +	Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, Joshua B. Tenenbaum	42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021)	We present a system for inductive program synthesis called DreamCoder, which inputs a corpus of synthesis problems each specified by one or a few examples, and automatically derives a library of program components and a neural search policy that can be used to efficiently solve other similar synthesis problems. The library and search policy bootstrap each other iteratively through a variant of “wake-sleep” approximate Bayesian learning. A new refactoring algorithm based on E-graph matching identifies common sub-components across synthesized programs, building a progressively deepening library of abstractions capturing the structure of the input domain. We evaluate on eight domains including classic program synthesis areas and AI tasks such as planning, inverse graphics, and equation discovery. We show that jointly learning the library and neural search policy leads to solving more problems, and solving them more quickly. +	synthesis search
2021	Generating Bug-Fixes Using Pretrained Transformers + + + + +	Dawn Drain, Chen Wu, Alexey Svyatkovskiy, Neel Sundaresan		Detecting and fixing bugs are two of the most important yet frustrating parts of the software development cycle. Existing bug detection tools are based mainly on static analyzers, which rely on mathematical logic and symbolic reasoning about the program execution to detect common types of bugs. Fixing bugs is typically left out to the developer. In this work we introduce DeepDebug: a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub repositories. We frame bug-patching as a sequence-to-sequence learning task consisting of two steps: (i) denoising pretraining, and (ii) supervised finetuning on the target translation task. We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch, while domain-adaptive pretraining from natural language to code further improves the accuracy by another 32%. We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art. In contrast to prior work, we attain our best results when generating raw code, as opposed to working with abstracted code that tends to only benefit smaller capacity models. Finally, we observe a subtle improvement from adding syntax embeddings along with the standard positional embeddings, as well as with adding an auxiliary task to predict each token’s syntactic class. Despite focusing on Java, our approach is language agnostic, requiring only a general-purpose parser such as tree-sitter. +	Transformer repair
2021	DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons + + + + +	Dawn Drain, Colin B. Clement, Guillermo Serrato, Neel Sundaresan		The joint task of bug localization and program repair is an integral part of the software development process. In this work we present DeepDebug, an approach to automated debugging using large, pretrained transformers. We begin by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs. We apply these synthetic bugs toward two ends. First, we directly train a backtranslation model on all functions from 200K repositories. Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions in those repositories that are covered by passing tests. This provides us with rich debugging information such as stack traces and print statements, which we use to finetune our model which was pretrained on raw source code. Finally, we strengthen all our models by expanding the context window beyond the buggy function itself, and adding a skeleton consisting of that function’s parent class, imports, signatures, docstrings, and method bodies, in order of priority. On the QuixBugs benchmark, we increase the total number of fixes found by over 50%, while also decreasing the false positive rate from 35% to 5% and decreasing the timeout from six hours to one minute. On our own benchmark of executable tests, our model fixes 68% of all bugs on its first attempt without using traces, and after adding traces it fixes 75% on first attempt. We will open-source our framework and validation set for evaluating on executable tests. +	repair Transformer
2021	Contrastive Learning for Source Code with Structural and Functional Properties + + + + +	Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, Saikat Chakraborty		Pre-trained transformer models have recently shown promises for understanding the source code. Most existing works expect to understand code from the textual features and limited structural knowledge of code. However, the program functionalities sometimes cannot be fully revealed by the code sequence, even with structure information. Programs can contain very different tokens and structures while sharing the same functionality, but changing only one or a few code tokens can introduce unexpected or malicious program behaviors while preserving the syntax and most tokens. In this work, we present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code. We first employ automated, structure-guided code transformation algorithms that generate (i.) functionally equivalent code that looks drastically different from the original one, and (ii.) textually and syntactically very similar code that is functionally distinct from the original. We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective. To encode the structure information, we introduce a new node-type masked language model objective that helps the model learn about structural context. We pre-train BOOST with a much smaller dataset than the state-of-the-art models, but our small models can still match or outperform these large models in code understanding and generation tasks. +	representation pretraining Transformer
2021	Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data + + + + +	Moshe Hazoom, Vibhor Malik, Ben Bogin	NLP4Prog	Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets. +	dataset
2021	Neural Program Repair with Execution-based Backpropagation + + + + +	He Ye, Matias Martinez, Monperrus Martin		Neural machine translation (NMT) architectures have achieved promising results for automatic program repair. Yet, they have the limitation of generating low-quality patches (e.g., not compilable patches). This is because the existing works only optimize a purely syntactic loss function based on characters and tokens without incorporating program-specific information during neural net weight optimization. In this paper, we propose a novel program repair model called RewardRepair. The core novelty of RewardRepair is to improve NMT-based program repair with a loss function based on program compilation and test execution information, rewarding the network to produce patches that compile and that do not overfit. We conduct several experiments to evaluate RewardRepair showing that it is feasible and effective to use compilation and test execution results to optimize the underlying neural repair model. In total, RewardRepair correctly repairs 43 Defects4J bugs including eight that are fixed for the first time. +	repair
2021	DeepMerge: Learning to Merge Programs + + + + +	Elizabeth Dinella, Todd Mytkowicz, Alexey Svyatkovskiy, Christian Bird, Mayur Naik, Shuvendu K. Lahiri		Program merging is ubiquitous in modern software development. Although commonly used in most version control systems, text-based merge algorithms are prone to producing spurious merge conflicts: they report a conflict even when program changes do not interfere with each other semantically. Spurious merge conflicts are costly to development as the need for manual intervention stalls modern continuous integration pipelines. We propose a novel data-driven approach to identify and resolve spurious merge conflicts with a sequence-to-sequence machine learning model. We realize our approach in a tool DeepMerge that uses a novel combination of (i) an edit-aware embedding of merge inputs and (ii) a variation of pointer networks to construct resolutions from input segments. We also propose an algorithm to extract ground truth manual resolutions from a code corpus and employ it to curate a dataset comprising 10,729 non-trivial resolutions in Javascript programs. Our evaluation shows that DeepMerge can predict correct resolutions with high precision (72%) and modest recall (34%) on the dataset overall, and high recall (78%) on merges comprising of upto 3 lines that comprise 24% of the dataset. +	edit repair
2021	MulCode: A Multi-task Learning Approach for Source Code Understanding + + + + +	Deze Wang, Yue Yu, Shanshan Li, Wei Dong, Ji Wang, Liao Qing	SANER	Recent years have witnessed the significant rise of Deep Learning (DL) techniques applied to source code. Researchers exploit DL for a multitude of tasks and achieve impressive results. However, most tasks are explored separately, resulting in a lack of generalization of the solutions. In this work, we propose MulCode, a multi-task learning approach for source code understanding that learns unified representation space for tasks, with the pre-trained BERT model for the token sequence and the Tree-LSTM model for abstract syntax trees. Furthermore, we integrate two source code views into a hybrid representation via the attention mechanism and set learnable uncertainty parameters to adjust the tasks’ relationship. We train and evaluate MulCode in three downstream tasks: comment classification, author attribution, and duplicate function detection. In all tasks, MulCode outperforms the state-of-theart techniques. Moreover, experiments on three unseen tasks demonstrate the generalization ability of MulCode compared with state-of-the-art embedding methods. +	representation
2021	A Syntax-Guided Edit Decoder for Neural Program Repair + + + + +	Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, Lu Zhang	FSE	Automated Program Repair (APR) helps improve the efficiency of software development and maintenance. Recent APR techniques use deep learning, particularly the encoder-decoder architecture, to generate patches. +Though existing DL-based APR approaches have proposed different encoder architectures, the decoder remains to be the standard one, which generates a sequence of tokens one by one to replace the faulty statement. +This decoder has multiple limitations: 1) allowing to generate syntactically incorrect programs, 2) inefficiently representing small edits, and 3) not being able to generate project-specific identifiers. +In this paper, we propose Recoder, a syntax-guided edit decoder with placeholder generation. Recoder is novel in multiple aspects: 1) Recoder generates edits rather than modified code, allowing efficient representation of small edits; 2) Recoder is syntax-guided, with the novel provider/decider architecture to ensure the syntactic correctness of the patched program and accurate generation; 3) Recoder generates placeholders that could be instantiated as project-specific identifiers later. +We conduct experiments to evaluate Recoder on 395 bugs from Defects4J v1.2, 420 additional bugs from Defects4J v2.0, 297 bugs from IntroClassJava and 40 bugs from QuixBugs. Our results show that Recoder repairs 53 bugs on Defects4J v1.2, which achieves 26.2% (11 bugs) improvement over the previous state-of-the-art approach for single-hunk bugs (TBar). Importantly, to our knowledge, Recoder is the first DL-based APR approach that has outperformed the traditional APR approaches on this benchmark. +	edit
2021	Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy + + + + +	Colin B. Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, Nan Duan, Neel Sundaresan, Alexey Svyatkovskiy		Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an architecture-independent approach for leveraging the syntactic hierarchies of source code for incorporating entire file-level context into a fixed-length window. Using concrete syntax trees of each source file we extract syntactic hierarchies and integrate them into context window by selectively removing from view more specific, less relevant scopes for a given task. We evaluate this approach on code generation tasks and joint translation of natural language and source code in Python programming language, achieving a new state-of-the-art in code completion and summarization for Python in the CodeXGLUE benchmark. We also introduce new CodeXGLUE benchmarks for user-experience-motivated tasks: code completion with normalized literals, method body completion/code summarization conditioned on file-level context. +	Transformer language model code generation
2021	Distilling Transformers for Neural Cross-Domain Search + + + + +	Colin B. Clement, Chen Wu, Dawn Drain, Neel Sundaresan		Pre-trained transformers have recently clinched top spots in the gamut of natural language tasks and pioneered solutions to software engineering tasks. Even information retrieval has not been immune to the charm of the transformer, though their large size and cost is generally a barrier to deployment. While there has been much work in streamlining, caching, and modifying transformer architectures for production, here we explore a new direction: distilling a large pre-trained translation model into a lightweight bi-encoder which can be efficiently cached and queried. We argue from a probabilistic perspective that sequence-to-sequence models are a conceptually ideal—albeit highly impractical—retriever. We derive a new distillation objective, implementing it as a data augmentation scheme. Using natural language source code search as a case study for cross-domain search, we demonstrate the validity of this idea by significantly improving upon the current leader of the CodeSearchNet challenge, a recent natural language code search benchmark. +	search Transformer
2021	On the Embeddings of Variables in Recurrent Neural Networks for Source Code + + + + +	Nadezhda Chirkova	NAACL	Source code processing heavily relies on the methods widely used in natural language processing (NLP), but involves specifics that need to be taken into account to achieve higher quality. An example of this specificity is that the semantics of a variable is defined not only by its name but also by the contexts in which the variable occurs. In this work, we develop dynamic embeddings, a recurrent mechanism that adjusts the learned semantics of the variable when it obtains more information about the variable’s role in the program. We show that using the proposed dynamic embeddings significantly improves the performance of the recurrent neural network, in code completion and bug fixing tasks. +	autocomplete
2021	Leveraging Language to Learn Program Abstractions and Search Heuristics + + + + +	Catherine Wong, Kevin Ellis, Joshua B. Tenenbaum, Jacob Andreas	Thirty-eighth International Conference on Machine Learning (ICML 2021)	Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains – string editing, image composition, and abstract reasoning about scenes – even when no natural language hints are available at test time. +	synthesis search
2021	Evaluating Large Language Models Trained on Code + + + + +	Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, Will Guss, Alex Nichol, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Andrew Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba		We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics. +	language model synthesis
2021	PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair + + + + +	Zimin Chen, Vincent J Hellendoorn, Pascal Lamblin, Petros Maniatis, Pierre-Antoine Manzagol, Daniel Tarlow, Subhodeep Moitra	NeurIPS	Machine learning for understanding and editing source code has recently attracted significant interest, with many developments in new models, new code representations, and new tasks.This proliferation can appear disparate and disconnected, making each approach seemingly unique and incompatible, thus obscuring the core machine learning challenges and contributions.In this work, we demonstrate that the landscape can be significantly simplified by taking a general approach of mapping a graph to a sequence of tokens and pointers.Our main result is to show that 16 recently published tasks of different shapes can be cast in this form, based on which a single model architecture achieves near or above state-of-the-art results on nearly all tasks, outperforming custom models like code2seq and alternative generic models like Transformers.This unification further enables multi-task learning and a series of cross-cutting experiments about the importance of different modeling choices for code understanding and repair tasks.The full framework, called PLUR, is easily extensible to more tasks, and will be open-sourced (https://github.com/google-research/plur). +	repair
2021	On Multi-Modal Learning of Editing Source Code + + + + +	Saikat Chakraborty, Baishakhi Ray		In recent years, Neural Machine Translator (NMT) has shown promise in automatically editing source code. Typical NMT based code editor only considers the code that needs to be changed as input and suggests developers with a ranked list of patched code to choose from - where the correct one may not always be at the top of the list. While NMT based code editing systems generate a broad spectrum of plausible patches, the correct one depends on the developers’ requirement and often on the context where the patch is applied. Thus, if developers provide some hints, using natural language, or providing patch context, NMT models can benefit from them. As a proof of concept, in this research, we leverage three modalities of information: edit location, edit code context, commit messages (as a proxy of developers’ hint in natural language) to automatically generate edits with NMT models. To that end, we build MODIT, a multi-modal NMT based code editing engine. With in-depth investigation and analysis, we show that developers’ hint as an input modality can narrow the search space for patches and outperform state-of-the-art models to generate correctly patched code in top-1 position. +	Transformer edit
2021	Deep Learning based Vulnerability Detection: Are We There Yet? + + + + +	Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, Baishakhi Ray	TSE	Automated detection of software vulnerabilities is a fundamental problem in software security. Existing program analysis techniques either suffer from high false positives or false negatives. Recent progress in Deep Learning (DL) has resulted in a surge of interest in applying DL for automated vulnerability detection. Several recent studies have demonstrated promising results achieving an accuracy of up to 95% at detecting vulnerabilities. In this paper, we ask, “how well do the state-of-the-art DL-based techniques perform in a real-world vulnerability prediction scenario?”. To our surprise, we find that their performance drops by more than 50%. A systematic investigation of what causes such precipitous performance drop reveals that existing DL-based vulnerability prediction approaches suffer from challenges with the training data (e.g., data duplication, unrealistic distribution of vulnerable classes, etc.) and with the model choices (e.g., simple token-based models). As a result, these approaches often do not learn features related to the actual cause of the vulnerabilities. Instead, they learn unrelated artifacts from the dataset (e.g., specific variable/function names, etc.). Leveraging these empirical findings, we demonstrate how a more principled approach to data collection and model design, based on realistic settings of vulnerability prediction, can lead to better solutions. The resulting tools perform significantly better than the studied baseline: up to 33.57% boost in precision and 128.38% boost in recall compared to the best performing model in the literature. Overall, this paper elucidates existing DL-based vulnerability prediction systems’ potential issues and draws a roadmap for future DL-based vulnerability prediction research. In that spirit, we make available all the artifacts supporting our results: https://git.io/Jf6IA +	defect survey
2021	InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees + + + + +	Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang	ICSE	Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other tasks. While some techniques produce representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. Although certain techniques generate representations from unlabeled code when applied to downstream tasks they are far from satisfactory. This paper proposes InferCode to overcome the limitation by adapting the self-supervised learning mechanism to build source code model. The key novelty lies in training code representations by predicting automatically identified subtrees from the context of the ASTs. Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We trained an InferCode model instance using the Tree-based CNN as the encoder of a large set of Java code and applied it to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search or reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model with a significant margin for most tasks including those involving different programming languages. +	representation
2021	Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations + + + + +	Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang	SIGIR	We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks. +	pretraining search
2021	CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation + + + + +	Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi	EMNLP	Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at https://github.com/salesforce/CodeT5 . +	Transformer
2021	TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer + + + + +	Berkay Berabi, Jingxuan He, Veselin Raychev, Martin Vechev	ICML	The problem of fixing errors in programs has attracted substantial interest over the years. The +key challenge for building an effective code fixing tool is to capture a wide range of errors and +meanwhile maintain high accuracy. In this paper, we address this challenge and present a new +learning-based system, called TFix. TFix works +directly on program text and phrases the problem of code fixing as a text-to-text task. In turn, +this enables it to leverage a powerful Transformer +based model pre-trained on natural language and +fine-tuned to generate code fixes (via a large, high-quality dataset obtained from GitHub commits). +TFix is not specific to a particular programming +language or class of defects and, in fact, improved +its precision by simultaneously fine-tuning on 52 +different error types reported by a popular static +analyzer. Our evaluation on a massive dataset of +JavaScript programs shows that TFix is practically +effective: it is able to synthesize code that fixes +the error in ∼67 percent of cases and significantly +outperforms existing learning-based approaches. +	repair
2021	Exploration of Convolutional Neural Network models for source code classification + + + + +	Francesco Barchi, Emanuele Parisi, Gianvito Urgese, Elisa Ficarra, Andrea Acquaviva		The application of Artificial Intelligence is becoming common in many engineering fields. Among them, one of the newest and rapidly evolving is software generation, where AI can be used to automatically optimise the implementation of an algorithm for a given computing platform. In particular, Deep Learning technologies can be used to the decide how to allocate pieces of code to hardware platforms with multiple cores and accelerators, that are common in high performance and edge computing applications. In this work, we explore the use of Convolutional Neural Networks (CNN)s to analyse the application source code and decide the best compute unit to minimise the execution time. We demonstrate that CNN models can be successfully applied to source code classification, providing higher accuracy with consistently reduced learning time with respect to state-of-the-art methods. Moreover, we show the robustness of the method with respect to source code pre-processing, compiler options and hyper-parameters selection. +	optimization static analysis program analysis language model
2021	Jointly Learning to Repair Code and Generate Commit Message + + + + +	Jiaqi Bai, Long Zhou, Ambrosio Blanco, Shujie Liu, Furu Wei, Ming Zhou, Zhoujun Li		We propose a novel task of jointly repairing program codes and generating commit messages. Code repair and commit message generation are two essential and related tasks for software development. However, existing work usually performs the two tasks independently. We construct a multilingual triple dataset including buggy code, fixed code, and commit messages for this novel task. We provide the cascaded models as baseline, which are enhanced with different training approaches, including the teacher-student method, the multi-task method, and the back-translation method. To deal with the error propagation problem of the cascaded method, the joint model is proposed that can both repair the code and generate the commit message in a unified framework. Experimental results show that the enhanced cascaded model with teacher-student method and multitask-learning method achieves the best score on different metrics of automated code repair, and the joint model behaves better than the cascaded model on commit message generation. +	edit Transformer
2021	Self-Supervised Bug Detection and Repair + + + + +	Miltiadis Allamanis, Henry Jackson-Flux, Marc Brockschmidt	NeurIPS	Machine learning-based program analyses have recently shown the promise of integrating formal and probabilistic reasoning towards aiding software development. However, in the absence of large annotated corpora, training these analyses is challenging. Towards addressing this, we present BugLab, an approach for self-supervised learning of bug detection and repair. BugLab co-trains two models: (1) a detector model that learns to detect and repair bugs in code, (2) a selector model that learns to create buggy code for the detector to use as training data. A Python implementation of BugLab improves by up to 30% upon baseline methods on a test dataset of 2374 real-life bugs and finds 19 previously unknown bugs in open-source software. +	GNN Transformer defect repair
2021	A large-scale benchmark for few-shot program induction and synthesis + + + + +	Ferran Alet, Javier Lopez-Contreras, James Koppel, Maxwell Nye, Armando Solar-Lezama, Tomas Lozano-Perez, Leslie Kaelbling, Joshua Tenenbaum	ICML	A landmark challenge for AI is to learn flexible, powerful representations from small numbers of examples. +On an important class of tasks, hypotheses in the form of programs provide extreme generalization capabilities from surprisingly few examples. However, whereas large natural few-shot learning image benchmarks have spurred progress in meta-learning for deep networks, there is no comparably big, natural program-synthesis dataset that can play a similar role. This is because, whereas images are relatively easy to label from internet meta-data or annotated by non-experts, generating meaningful input-output examples for program induction has proven hard to scale. In this work, we propose a new way of leveraging unit tests and natural inputs for small programs as meaningful input-output examples for each sub-program of the overall program. This allows us to create a large-scale naturalistic few-shot program-induction benchmark and propose new challenges in this domain. The evaluation of multiple program induction and synthesis algorithms points to shortcomings of current methods and suggests multiple avenues for future work. +	dataset synthesis
2021	Improving Code Autocompletion with Transfer Learning + + + + +	Wen Zhou, Seohyun Kim, Vijayaraghavan Murali, Gareth Ari Aye		Software language models have achieved promising results predicting code completion usages, and several industry studies have described successful IDE integrations. Recently, accuracy in autocompletion prediction improved 12.8% from training on a real-world dataset collected from programmers’ IDE activity. But what if limited examples of IDE autocompletion in the target programming language are available for model training? In this paper, we investigate the efficacy of pretraining autocompletion models on non-IDE, non-autocompletion, and different-language example code sequences. We find that these unsupervised pretrainings improve model accuracy by over 50% on very small fine-tuning datasets and over 10% on 50k labeled examples. We confirm the real-world impact of these pretrainings in an online setting through A/B testing on thousands of IDE autocompletion users, finding that pretraining is responsible for increases of up to 6.63% autocompletion usage. +	autocomplete Transformer
2021	Unified Pre-training for Program Understanding and Generation + + + + +	Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang	NAACL	Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on language generation tasks, including code summarization, generation, translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations. +	pretraining Transformer
2021	Bag-of-Words Baselines for Semantic Code Search + + + + +	Xinyu Zhang, Ji Xin, Andrew Yates, Jimmy Lin	NLP4Prog	The task of semantic code search is to retrieve code snippets from a source code corpus based on an information need expressed in natural language. The semantic gap between natural language and programming languages has for long been regarded as one of the most significant obstacles to the effectiveness of keyword-based information retrieval (IR) methods. It is a common assumption that “traditional” bag-of-words IR methods are poorly suited for semantic code search: our work empirically investigates this assumption. Specifically, we examine the effectiveness of two traditional IR methods, namely BM25 and RM3, on the CodeSearchNet Corpus, which consists of natural language queries paired with relevant code snippets. We find that the two keyword-based methods outperform several pre-BERT neural models. We also compare several code-specific data pre-processing strategies and find that specialized tokenization improves effectiveness. +	search
2021	ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback + + + + +	Mike Wu, Noah D. Goodman, Chris Piech, Chelsea Finn		High-quality computer science education is limited by the difficulty of providing instructor feedback to students at scale. While this feedback could in principle be automated, supervised approaches to predicting the correct feedback are bottlenecked by the intractability of annotating large quantities of student code. In this paper, we instead frame the problem of providing feedback as few-shot classification, where a meta-learner adapts to give feedback to student code on a new programming question from just a few examples annotated by instructors. Because data for meta-training is limited, we propose a number of amendments to the typical few-shot learning framework, including task augmentation to create synthetic tasks, and additional side information to build stronger priors about each task. These additions are combined with a transformer architecture to embed discrete sequences (e.g. code) to a prototypical representation of a feedback class label. On a suite of few-shot natural language processing tasks, we match or outperform state-of-the-art performance. Then, on a collection of student solutions to exam questions from an introductory university course, we show that our approach reaches an average precision of 88% on unseen questions, surpassing the 82% precision of teaching assistants. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university. This is, to the best of our knowledge, the first successful deployment of a machine learning based feedback to open-ended student code. +	Transformer education
2021	A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research + + + + +	Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, Denys Poshyvanyk	TSE	An increasingly popular set of techniques adopted by software engineering (SE) researchers to automate development tasks are those rooted in the concept of Deep Learning (DL). The popularity of such techniques largely stems from their automated feature engineering capabilities, which aid in modeling software artifacts. However, due to the rapid pace at which DL techniques have been adopted, it is difficult to distill the current successes, failures, and opportunities of the current research landscape. In an effort to bring clarity to this crosscutting area of work, from its modern inception to the present, this paper presents a systematic literature review of research at the intersection of SE & DL. The review canvases work appearing in the most prominent SE and DL conferences and journals and spans 128 papers across 23 unique SE tasks. We center our analysis around the components of learning, a set of principles that govern the application of machine learning techniques (ML) to a given problem domain, discussing several aspects of the surveyed work at a granular level. The end result of our analysis is a research roadmap that both delineates the foundations of DL techniques applied to SE research, and highlights likely areas of fertile exploration for the future. +	survey
2021	An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions + + + + +	Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri		There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described `AI pair programmer’, GitHub Copilot, a language model trained over open-source GitHub code. However, code often contains bugs - and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot’s code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis we prompt Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those from MITRE’s “Top 25” list). We explore Copilot’s performance on three distinct code generation axes – examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,692 programs. Of these, we found approximately 40% to be vulnerable. +	Transformer language model
2021	A Semantic Bug Seeding: A Learning-Based Approach for Creating Realistic Bugs + + + + +	Jibesh Patra, Michael Pradel	FSE	When working on techniques to address the wide-spread problem +of software bugs, one often faces the need for a large number of +realistic bugs in real-world programs. Such bugs can either help +evaluate an approach, e.g., in form of a bug benchmark or a suite +of program mutations, or even help build the technique, e.g., in +learning-based bug detection. Because gathering a large number ofreal bugs is difficult, +a common approach is to rely on automatically +seeded bugs. Prior work seeds bugs based on syntactic transformation patterns, +which often results in unrealistic bugs and typically +cannot introduce new, application-specific code tokens. This paper +presents SemSeed, a technique for automatically seeding bugs in +a semantics-aware way. The key idea is to imitate how a given +real-world bug would look like in other programs by semantically +adapting the bug pattern to the local context. To reason about the +semantics of pieces of code, our approach builds on learned token embeddings +that encode the semantic similarities of identifiers and literals. Our +evaluation with real-world JavaScript softwares +hows that the approach effectively reproduces real bugs and clearly +outperforms a semantics-unaware approach. The seeded bugs are +useful as training data for learning-based bug detection, where +they significantly improve the bug detection ability. Moreover, we +show that SemSeed-created bugs complement existing mutation +testing operators, and that our approach is efficient enough to seed +hundreds of thousands of bugs within an hour. +	repair edit
2021	How could Neural Networks understand Programs? + + + + +	Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, Tie-Yan Liu	ICML	Semantic understanding of programs is a fundamental problem for programming language processing (PLP). Recent works that learn representations of code based on pre-training techniques in NLP have pushed the frontiers in this direction. However, the semantics of PL and NL have essential differences. These being ignored, we believe it is difficult to build a model to better understand programs, by either directly applying off-the-shelf NLP pre-training techniques to the source code, or adding features to the model by the heuristic. In fact, the semantics of a program can be rigorously defined by formal semantics in PL theory. For example, the operational semantics, describes the meaning of a valid program as updating the environment (i.e., the memory address-value function) through fundamental operations, such as memory I/O and conditional branching. Inspired by this, we propose a novel program semantics learning paradigm, that the model should learn from information composed of (1) the representations which align well with the fundamental operations in operational semantics, and (2) the information of environment transition, which is indispensable for program understanding. To validate our proposal, we present a hierarchical Transformer-based pre-training model called OSCAR to better facilitate the understanding of programs. OSCAR learns from intermediate representation (IR) and an encoded representation derived from static analysis, which are used for representing the fundamental operations and approximating the environment transitions respectively. OSCAR empirically shows the outstanding capability of program semantics understanding on many practical software engineering tasks. +	Transformer
2021	Retrieval Augmented Code Generation and Summarization + + + + +	Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang	EMNLP-Findings	Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers’ code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. REDCODER has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework. +	Transformer summarization code generation
2021	ConTest: A Unit Test Completion Benchmark featuring Context + + + + +	Johannes Villmow, Jonas Depoix, Adrian Ulges	NLP4Prog	We introduce CONTEST, a benchmark for NLP-based unit test completion, the task of predicting a test’s assert statements given its setup and focal method, i.e. the method to be tested. ConTest is large-scale (with 365k datapoints). Besides the test code and tested code, it also features context code called by either. We found context to be crucial for accurately predicting assertions. We also introduce baselines based on transformer encoder-decoders, and study the effects of including syntactic information and context. Overall, our models achieve a BLEU score of 38.2, while only generating unparsable code in 1.92% of cases. +	benchmark dataset verification Transformer
2021	Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers + + + + +	Emanuele Parisi, Francesco Barchi, Andrea Bartolini, Giuseppe Tagliavini, Andrea Acquaviva	DATE	The analysis of source code through machine learning techniques is an increasingly explored research topic aiming at increasing smartness in the software toolchain to exploit modern architectures in the best possible way. In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption. Depending on the kernel to be executed, the energy optimal scaling configuration is not trivial. While recent work has focused on general-purpose systems to learn and predict the best execution target in terms of the execution time of a snippet of code or kernel (e.g. offload OpenCL kernel on multicore CPU or GPU), in this work we focus on static compile-time features to assess if they can be successfully used to predict the minimum energy configuration on PULP, an ultra-low-power architecture featuring an on-chip cluster of RISC-V processors. Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation. +	optimization program analysis
2021	Learning to Describe Solutions for Bug Reports Based on Developer Discussions + + + + +	Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, Raymond J. Mooney		When a software bug is reported, developers engage in a discussion to collaboratively resolve it. While the solution is likely formulated within the discussion, it is often buried in a large amount of text, making it difficult to comprehend, which delays its implementation. To expedite bug resolution, we propose generating a concise natural language description of the solution by synthesizing relevant content within the discussion, which encompasses both natural language and source code. Furthermore, to support generating an informative description during an ongoing discussion, we propose a secondary task of determining when sufficient context about the solution emerges in real-time. We construct a dataset for these tasks with a novel technique for obtaining noisy supervision from repository changes linked to bug reports. We establish baselines for generating solution descriptions, and develop a classifier which makes a prediction following each new utterance on whether or not the necessary context for performing generation is available. Through automated and human evaluation, we find these tasks to form an ideal testbed for complex reasoning in long, bimodal dialogue context. +	summarization documentation
2021	Understanding Neural Code Intelligence Through Program Simplification + + + + +	Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, Mohammad Amin Alipour	ESEC/FSE	A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of “transparent/interpretable-AI”. However, these approaches are often specific to a particular set of network architectures, even requiring access to the network’s parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND’s extracted features may help understand neural CI systems’ predictions and learned behavior. +	interpretability refactoring information extraction
2021	On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations + + + + +	Md Rafiqul Islam Rabin, Nghi D. Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, Mohammad Amin Alipour	IST	With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees. On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Our results on the generalizability of neural program models provide insights to measure their limitations and provide a stepping stone for their improvement. +	evaluation adversarial generalizability refactoring summarization
2021	Unsupervised Learning of General-Purpose Embeddings for Code Changes + + + + +	Mikhail Pravilov, Egor Bogomolov, Yaroslav Golubev, Timofey Bryksin		Applying machine learning to tasks that operate with code changes requires their numerical representation. In this work, we propose an approach for obtaining such representations during pre-training and evaluate them on two different downstream tasks - applying changes to code and commit message generation. During pre-training, the model learns to apply the given code change in a correct way. This task requires only code changes themselves, which makes it unsupervised. In the task of applying code changes, our model outperforms baseline models by 5.9 percentage points in accuracy. As for the commit message generation, our model demonstrated the same results as supervised models trained for this specific task, which indicates that it can encode code changes well and can be improved in the future by pre-training on a larger dataset of easily gathered code changes. +	edit representation
2021	Time-Efficient Code Completion Model for the R Programming Language + + + + +	Artem Popov, Dmitrii Orekhov, Denis Litvinov, Nikolay Korolev, Gleb Morgachev	NLP4Prog	In this paper we present a deep learning code completion model for the R language. We introduce several techniques to utilize language modeling based architecture in the code completion task. With these techniques, the model requires low resources, but still achieves high quality. We also present an evaluation dataset for the R language completion task. Our dataset contains multiple autocompletion usage contexts that provides robust validation results. The dataset is publicly available. +	dataset language model code generation Transformer
2021	Leveraging Automated Unit Tests for Unsupervised Code Translation + + + + +	Baptiste Roziere, Jie M. Zhang, Francois Charton, Mark Harman, Gabriel Synnaeve, Guillaume Lample		With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java → Python and Python → C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%. +	migration
2021	DOBF: A Deobfuscation Pre-Training Objective for Programming Languages + + + + +	Baptiste Roziere, Marie-Anne Lachaux, Marc Szafraniec, Guillaume Lample		Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 13% in unsupervised code translation, and 24% in natural language code search. Incidentally, we found that our pre-trained model is able to de-obfuscate fully obfuscated source files, and to suggest descriptive variable names. +	pretraining
2021	You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion + + + + +	Roei Schuster, Congzheng Song, Eran Tromer, Vitaly Shmatikov	USENIX Security	Code autocompletion is an integral feature of modern code editors and IDEs. The latest generation of autocompleters uses neural language models, trained on public open-source code repositories, to suggest likely (not just statically feasible) completions given the current context. + + We demonstrate that neural code autocompleters are vulnerable to poisoning attacks. By adding a few specially-crafted files to the autocompleter’s training corpus (data poisoning), or else by directly fine-tuning the autocompleter on these files (model poisoning), the attacker can influence its suggestions for attacker-chosen contexts. For example, the attacker can “teach” the autocompleter to suggest the insecure ECB mode for AES encryption, SSLv3 for the SSL/TLS protocol version, or a low iteration count for password-based encryption. Moreover, we show that these attacks can be targeted: an autocompleter poisoned by a targeted attack is much more likely to suggest the insecure completion for files from a specific repo or specific developer. + + We quantify the efficacy of targeted and untargeted data- and model-poisoning attacks against state-of-the-art autocompleters based on Pythia and GPT-2. We then evaluate existing defenses against poisoning attacks and show that they are largely ineffective. +	autocomplete adversarial
2021	Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks + + + + +	Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Ulrich Finkler		Advancements in deep learning and machine learning algorithms have enabled +breakthrough progress in computer vision, speech recognition, natural language +processing and beyond. In addition, over the last several decades, software has +been built into the fabric of every aspect of our society. Together, these two +trends have generated new interest in the fast-emerging research area of “AI for +Code”. As software development becomes ubiquitous across all industries and code +infrastructure of enterprise legacy applications ages, it is more critical than ever +to increase software development productivity and modernize legacy applications. +Over the last decade, datasets like ImageNet, with its large scale and diversity, +have played a pivotal role in algorithmic advancements from computer vision to +language and speech understanding. In this paper, we present “Project CodeNet”, +a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate +the algorithmic advancements in AI for Code. It consists of 14M code samples +and about 500M lines of code in 55 different programming languages. Project +CodeNet is not only unique in its scale, but also in the diversity of coding tasks +it can help benchmark: from code similarity and classification for advances in +code recommendation algorithms, and code translation between a large variety +programming languages, to advances in code performance (both runtime, and +memory) improvement techniques. CodeNet also provides sample input and output +test sets for over 7M code samples, which can be critical for determining code +equivalence in different languages. As a usability feature, we provide several +preprocessing tools in Project CodeNet to transform source codes into representations +that can be readily used as inputs into machine learning models. +	dataset
2021	Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation + + + + +	Gabriel Orlanski, Alex Gittens	NLP4Prog	Answering a programming question with only its title is difficult as salient contextual information is left out. To address this, we present a corpus of over 40,000 StackOverflow question texts to be used in conjunction with the corresponding intents from the CoNaLa dataset (Yin et al., 2018). Using both the intent and the question body, we use BART to establish a baseline BLEU score of 34.35 for this new task. We then find further improvements of 2.8% by combining the mined CoNaLa data with the labeled data to achieve a 35.32 BLEU score. We then evaluate the prior state-of-the-art CoNaLa models with this additional data. We find that our proposed method of using the body and mined data beats that of the previous state-of-the-art by a 71.96% BLEU score. Finally, we perform ablations that prove that BART is an unsupervised multimodal learner and examine its extractive behavior. +	dataset Transformer
2021	CoTexT: Multi-task Learning with Code-Text Transformer + + + + +	Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, Yanfang Ye	NLP4Prog	We present CoTexT, a transformer-based architecture encoder-decoder pre-trained model that learns the representative context between natural language (NL) and programming language (PL) through multi-task learning. CoTexT is pre-trained, in self-supervised fashion, based on large programming language corpus to learn general-purpose understanding and code-text generation supporting downstream NL-PL task such as code summarizing/documentation, code generation, defect detection, code debugging, etc. We train CoTexT on different combination of available PL corpus including both “bimodal” and “unimodal” data where the former is the combinations of both natural texts and their corresponding code snippets in an input sequence and the latter is merely code snippets. We evaluate multi-task learning CoTexT on different generation and classification tasks on CodeXGLUE and it achieves state-of-the-art on all downstream tasks. +	Transformer
2021	DIRECT : A Transformer-based Model for Decompiled Identifier Renaming + + + + +	Vikram Nitin, Anthony Saieva, Baishakhi Ray, Gail Kaiser	NLP4Prog	Decompiling binary executables to high-level code is an important step in reverse engineering scenarios, such as malware analysis and legacy code maintenance. However, the generated high-level code is difficult to understand since the original variable names are lost. In this paper, we leverage transformer models to reconstruct the original variable names from decompiled code. Inherent differences between code and natural language present certain challenges in applying conventional transformer-based architectures to variable name recovery. We propose DIRECT, a novel transformer-based architecture customized specifically for the task at hand. We evaluate our model on a dataset of decompiled functions and find that DIRECT outperforms the previous state-of-the-art model by up to 20%. We also present ablation studies evaluating the impact of each of our modifications. We make the source code of DIRECT available to encourage reproducible research. +	Transformer decompilation
2021	Program Synthesis with Large Language Models + + + + +	Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton		This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model’s ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model’s initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input. +	Transformer synthesis
2021	Impact of Evaluation Methodologies on Code Summarization + + + + +	Pengyu Nie, Jiyang Zhang, Junyi Jessy Li, Raymond J. Mooney, Milos Gligoric	ACL	There has been a growing interest in developing machine learning (ML) models for code summarization tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and test sets, were not well studied. Specifically, no prior work on code summarization considered the timestamps of code and comments during evaluation. This may lead to evaluations that are inconsistent with the intended use cases. In this paper, we introduce the time-segmented evaluation methodology, which is novel to the code summarization research community, and compare it with the mixed-project and cross-project methodologies that have been commonly used. Each methodology can be mapped to some use cases, and the time-segmented methodology should be adopted in the evaluation of ML models for code summarization. To assess the impact of methodologies, we collect a dataset of (code, comment) pairs with timestamps to train and evaluate several recent ML models for code summarization. Our experiments show that different methodologies lead to conflicting evaluation results. We invite the community to expand the set of methodologies used in evaluations. +	evaluation dataset
2021	Show Your Work: Scratchpads for Intermediate Computation with Language Models + + + + +	Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena		Large pre-trained language models perform remarkably well on tasks that can be done “in one pass”, such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations – even in the few-shot regime – when asked to perform the operation “step by step”, showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a “scratchpad”. On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations. +	Transformer execution
2021	Neural Program Generation Modulo Static Analysis + + + + +	Rohan Mukherjee, Yeming Wen, Dipak Chaudhari, Thomas W. Reps, Swarat Chaudhuri, Chris Jermaine	NeurIPS	State-of-the-art neural models of source code tend to be evaluated on the generation +of individual expressions and lines of code, and commonly fail on long-horizon +tasks such as the generation of entire method bodies. We propose to address this +deficiency using weak supervision from a static program analyzer. Our neurosymbolic method allows a deep generative model to symbolically compute, using calls +to a static-analysis tool, long-distance semantic relationships in the code that it +has already generated. During training, the model observes these relationships +and learns to generate programs conditioned on them. We apply our approach to +the problem of generating entire Java methods given the remainder of the class +that contains the method. Our experiments show that the approach substantially +outperforms state-of-the-art transformers and a model that explicitly tries to learn +program semantics on this task, both in terms of producing programs free of basic +semantic errors and in terms of syntactically matching the ground truth. +	synthesis language model
2021	Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size + + + + +	Martin Monperrus, Matias Martinez, He Ye, Fernanda Madeiral, Thomas Durieux, Zhongxing Yu		This paper presents Megadiff, a dataset of source code diffs. It focuses on Java, with strict inclusion criteria based on commit message and diff size. Megadiff contains 663 029 Java diffs that can be used for research on commit comprehension, fault localization, automated program repair, and machine learning on code changes. +	dataset edit
2021	ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference + + + + +	Amir M. Mir, Evaldas Latoskinas, Georgios Gousios	MSR	In this paper, we present ManyTypes4Py, a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5,382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of the duplication bias. To facilitate training and evaluation of ML models, the dataset was split into training, validation and test sets by files. To extract type information from abstract syntax trees (ASTs), a lightweight static analyzer pipeline is developed and accompanied with the dataset. Using this pipeline, the collected Python projects were analyzed and the results of the AST analysis were stored in JSON-formatted files. The ManyTypes4Py dataset is shared on zenodo and its tools are publicly available on GitHub. +	dataset types
2021	Type4Py: Deep Similarity Learning-Based Type Inference for Python + + + + +	Amir M. Mir, Evaldas Latoskinas, Sebastian Proksch, Georgios Gousios		Dynamic languages, such as Python and Javascript, trade static typing for developer flexibility. While this allegedly enables greater productivity, lack of static typing can cause runtime exceptions, type inconsistencies, and is a major factor for weak IDE support. To alleviate these issues, PEP 484 introduced optional type annotations for Python. As retrofitting types to existing codebases is error-prone and laborious, learning-based approaches have been proposed to enable automatic type annotations based on existing, partially annotated codebases. However, the prediction of rare and user-defined types is still challenging. In this paper, we present Type4Py, a deep similarity learning-based type inference model for Python. We design a hierarchical neural network model that learns to discriminate between types of the same kind and dissimilar types in a high-dimensional space, which results in clusters of types. Nearest neighbor search suggests likely type signatures of given Python functions. The types visible to analyzed modules are surfaced using lightweight dependency analysis. The results of quantitative and qualitative evaluation indicate that Type4Py significantly outperforms state-of-the-art approaches at the type prediction task. Considering the Top-1 prediction, Type4Py obtains 19.33% and 13.49% higher precision than Typilus and TypeWriter, respectively, while utilizing a much bigger vocabulary. +	types
2020	Improved Automatic Summarization of Subroutines via Attention to File Context + + + + +	Sakib Haque, Alexander LeClair, Lingfei Wu, Collin McMillan		Software documentation largely consists of short, natural language summaries of the subroutines in the software. These summaries help programmers quickly understand what a subroutine does without having to read the source code him or herself. The task of writing these descriptions is called “source code summarization” and has been a target of research for several years. Recently, AI-based approaches have superseded older, heuristic-based approaches. Yet, to date these AI-based approaches assume that all the content needed to predict summaries is inside subroutine itself. This assumption limits performance because many subroutines cannot be understood without surrounding context. In this paper, we present an approach that models the file context of subroutines (i.e. other subroutines in the same file) and uses an attention mechanism to find words and concepts to use in summaries. We show in an experiment that our approach extends and improves several recent baselines. +	summarization
2020	A Multi-Perspective Architecture for Semantic Code Search + + + + +	Rajarshi Haldar, Lingfei Wu, Jinjun Xiong, Julia Hockenmaier	ACL	The ability to match pieces of code to their corresponding natural language descriptions and vice versa is fundamental for natural language search interfaces to software repositories. In this paper, we propose a novel multi-perspective cross-lingual neural framework for code–text matching, inspired in part by a previous model for monolingual text-to-text matching, to capture both global and local similarities. Our experiments on the CoNaLa dataset show that our proposed model yields better performance on this cross-lingual text-to-code matching task than previous approaches that map code and text to a single joint embedding space. +	search
2020	Fast and Memory-Efficient Neural Code Completion + + + + +	Alexey Svyatkovskiy, Sebastian Lee, Anna Hadjitofi, Maik Riechert, Juliana Franco, Miltiadis Allamanis		Code completion is one of the most widely used features of modern integrated development environments (IDEs). Deep learning has recently made significant progress in the statistical prediction of source code. However, state-of-the-art neural network models consume prohibitively large amounts of memory, causing computational burden to the development environment, especially when deployed in lightweight client devices. + + In this work, we reframe neural code completion from a generation task to a task of learning to rank the valid completion suggestions computed from static analyses. By doing so, we are able to design and test a variety of deep neural network model configurations. One of our best models consumes 6 MB of RAM, computes a single suggestion in 8 ms, and achieves 90% recall in its top five suggestions. Our models outperform standard language modeling code completion techniques in terms of predictive performance, computational speed, and memory efficiency. Furthermore, they learn about code semantics from the natural language aspects of the code (e.g. identifier names) and can generalize better to previously unseen code. +	autocomplete
2020	GraphCodeBERT: Pre-training Code Representations with Data Flow + + + + +	Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang, Ming Zhou		Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of “where-the-value-comes-from” between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search. +	pretraining
2020	Code to Comment "Translation": Data, Metrics, Baselining & Evaluation + + + + +	David Gros, Hariharan Sezhiyan, Premkumar Devanbu, Zhou Yu		The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep learning methods to this task, and specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CodeNN, DeepCom, FunCom, and DocString. We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using “affinity pairs” of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area. +	bimodal documentation
2020	Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair + + + + +	Haoye Tian, Kui Liu, Abdoul Kader Kaboreé, Anil Koyuncu, Li Li, Jacques Klein, Tegawendé F. Bissyandé		A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study the benefit of learning code representations to learn deep features that may encode the properties of patch correctness. Our work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings associated with logistic regression yielded an AUC value of about 0.8 in predicting patch correctness on a deduplicated dataset of 1000 labeled patches. Our study shows that learned representations can lead to reasonable performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. These representations may further be complementary to features that were carefully (manually) engineered in the literature. +	repair Transformer
2020	IntelliCode Compose: Code Generation Using Transformer + + + + +	Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, Neel Sundaresan		In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. +In this paper, we introduce IntelliCode Compose − a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. +Our best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language. +	autocomplete code generation synthesis language model pretraining
2020	CodeBERT: A Pre-Trained Model for Programming and Natural Languages + + + + +	Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou		We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing. +	pretraining
2020	On-the-Fly Adaptation of Source Code Models using Meta-Learning + + + + +	Disha Shrivastava, Hugo Larochelle, Daniel Tarlow		The ability to adapt to unseen, local contexts is an important challenge that successful models of source code must overcome. One of the most popular approaches for the adaptation of such models is dynamic evaluation. With dynamic evaluation, when running a model on an unseen file, the model is updated immediately after having observed each token in that file. In this work, we propose instead to frame the problem of context adaptation as a meta-learning problem. We aim to train a base source code model that is best able to learn from information in a file to deliver improved predictions of missing tokens. Unlike dynamic evaluation, this formulation allows us to select more targeted information (support tokens) for adaptation, that is both before and after a target hole in a file. We consider an evaluation setting that we call line-level maintenance, designed to reflect the downstream task of code auto-completion in an IDE. Leveraging recent developments in meta-learning such as first-order MAML and Reptile, we demonstrate improved performance in experiments on a large scale Java GitHub corpus, compared to other adaptation baselines including dynamic evaluation. Moreover, our analysis shows that, compared to a non-adaptive baseline, our approach improves performance on identifiers and literals by 44% and 15%, respectively. +	language model autocomplete
2020	Neural Software Analysis + + + + +	Michael Pradel, Satish Chandra		Many software development problems can be addressed by program analysis tools, which traditionally are based on precise, logical reasoning and heuristics to ensure that the tools are practical. Recent work has shown tremendous success through an alternative way of creating developer tools, which we call neural software analysis. The key idea is to train a neural machine learning model on numerous code examples, which, once trained, makes predictions about previously unseen code. In contrast to traditional program analysis, neural software analysis naturally handles fuzzy information, such as coding conventions and natural language embedded in code, without relying on manually encoded heuristics. This article gives an overview of neural software analysis, discusses when to (not) use it, and presents three example analyses. The analyses address challenging software development problems: bug detection, type prediction, and code completion. The resulting tools complement and outperform traditional program analyses, and are used in industrial practice. +	program analysis survey
2020	NaturalCC: A Toolkit to Naturalize the Source Code Corpus + + + + +	Yao Wan, Yang He, Jian-Guo Zhang, Yulei Sui, Hai Jin, Guandong Xu, Caiming Xiong, Philip S. Yu		We present NaturalCC, an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis. Using NaturalCC, researchers both from natural language or programming language communities can quickly and easily reproduce the state-of-the-art baselines and implement their approach. NaturalCC is built upon Fairseq and PyTorch, providing (1) an efficient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modular and extensible framework that makes it easy to reproduce or implement an approach for big code analysis, and (3) a command line interface and a graphical user interface to demonstrate each model’s performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion, code comment generation, and code retrieval) for demonstration. The video of this demo is available at https://www.youtube.com/watch?v=q4W5VSI-u3E&t=25s. +	documentation search summarization
2020	Hoppity: Learning Bug Detection and Repair + + + + +	Elizabeth Dinella, Hanjun Dai, Ziyang Li, Mayur Naik, Le Song, Ke Wang	ICLR	We present a learning-based approach to detect and fix a broad range of bugs in Javascript programs. We frame the problem in terms of learning a sequence of graph transformations: given a buggy program modeled by a graph structure, our model makes a sequence of predictions including the position of bug nodes and corresponding graph edits to produce a fix. Unlike previous works that use deep neural networks, our approach targets bugs that are more complex and semantic in nature (i.e.~bugs that require adding or deleting statements to fix). We have realized our approach in a tool called HOPPITY. By training on 338,877 Javascript code change commits on Github, HOPPITY correctly detects and fixes bugs in 9,612 out of 42,365 programs in an end-to-end fashion. Given the bug location and type of the fix, HOPPITY also outperforms the baseline approach by a wide margin. +	edit repair
2020	CoNCRA: A Convolutional Neural Network Code Retrieval Approach + + + + +	Marcelo de Rezende Martins, Marco Aurélio Gerosa	SBES '20	Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer’s intent, expressed in natural language. We evaluated our approach’s efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval. + +	search
2020	Deep Learning & Software Engineering: State of Research and Future Directions + + + + +	Prem Devanbu, Matthew Dwyer, Sebastian Elbaum, Michael Lowry, Kevin Moran, Denys Poshyvanyk, Baishakhi Ray, Rishabh Singh, Xiangyu Zhang		Given the current transformative potential of research that sits at the intersection of Deep Learning (DL) and Software Engineering (SE), an NSF-sponsored community workshop was conducted in co-location with the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19) in San Diego, California. The goal of this workshop was to outline high priority areas for cross-cutting research. While a multitude of exciting directions for future work were identified, this report provides a general summary of the research areas representing the areas of highest priority which were discussed at the workshop. The intent of this report is to serve as a potential roadmap to guide future work that sits at the intersection of SE & DL. +	survey
2020	ProGraML: Graph-based Deep Learning for Program Optimization and Analysis + + + + +	Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Hugh Leather		The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in two key areas: a representation that accurately captures the semantics of programs, and a model architecture with sufficient expressiveness to reason about this representation. + + We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs. The ProGraML representation is a directed attributed multigraph that captures control, data, and call relations, and summarizes instruction and operand types and ordering. Message Passing Neural Networks propagate information through this structured representation, enabling whole-program or per-vertex classification tasks. + + ProGraML provides a general-purpose program representation that equips learnable models to perform the types of program analysis that are fundamental to optimization. To this end, we evaluate the performance of our approach first on a suite of traditional compiler analysis tasks: control flow reachability, dominator trees, data dependencies, variable liveness, and common subexpression detection. On a benchmark dataset of 250k LLVM-IR files covering six source programming languages, ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches. We then apply our approach to two high-level tasks - heterogeneous device mapping and program classification - setting new state-of-the-art performance in both. +	dataset GNN
2020	Code and Named Entity Recognition in StackOverflow + + + + +	Jeniya Tabassum, Mounica Maddela, Wei Xu, Alan Ritter	ACL	There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F-1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F1 score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model. +	dataset information extraction
2020	Incorporating External Knowledge through Pre-training for Natural Language to Code Generation + + + + +	Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, Graham Neubig	ACL	Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at [Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen. +	bimodal code generation
2020	Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer + + + + +	Suyoung Lee, HyungSeok Han, Sang Kil Cha, Sooel Son	USENIX	JavaScript (JS) engine vulnerabilities pose significant security threats affecting billions of web browsers. While fuzzing is a prevalent technique for finding such vulnerabilities, there have been few studies that leverage the recent advances in neural network language models (NNLMs). In this paper, we present Montage, the first NNLM-guided fuzzer for finding JS engine vulnerabilities. The key aspect of our technique is to transform a JS abstract syntax tree (AST) into a sequence of AST subtrees that can directly train prevailing NNLMs. We demonstrate that Montage is capable of generating valid JS tests, and show that it outperforms previous studies in terms of finding vulnerabilities. Montage found 37 real-world bugs, including three CVEs, in the latest JS engines, demonstrating its efficacy in finding JS engine bugs. +	fuzzing language model
2020	Improved Code Summarization via a Graph Neural Network + + + + +	Alexander LeClair, Sakib Haque, Lingfei Wu, Collin McMillan		Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. The first approaches to use structural information flattened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using flattened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. Therefore, in this paper, we present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the software engineering literature, and two from machine learning literature. +	summarization
2020	Learning to Update Natural Language Comments Based on Code Changes + + + + +	Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Raymond J. Mooney, Junyi Jessy Li	ACL	We formulate the novel task of automatically updating an existing natural language comment based on changes in the body of code it accompanies. We propose an approach that learns to correlate changes across two distinct language representations, to generate a sequence of edits that are applied to the existing comment to reflect the source code modifications. We train and evaluate our model using a dataset that we collected from commit histories of open-source software projects, with each example consisting of a concurrent update to a method and its corresponding comment. We compare our approach against multiple baselines using both automatic metrics and human evaluation. Results reflect the challenge of this task and that our model outperforms baselines with respect to making edits. +	bimodal edit documentation
2020	Unsupervised Translation of Programming Languages + + + + +	Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample		A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin. +	migration
2020	Recommendation of Move Method Refactoring Using Path-Based Representation of Code + + + + +	Zarina Kurbatova, Ivan Veselov, Yaroslav Golubev, Timofey Bryksin		Software refactoring plays an important role in increasing code quality. One of the most popular refactoring types is the Move Method refactoring. It is usually applied when a method depends more on members of other classes than on its own original class. Several approaches have been proposed to recommend Move Method refactoring automatically. Most of them are based on heuristics and have certain limitations (e.g., they depend on the selection of metrics and manually-defined thresholds). In this paper, we propose an approach to recommend Move Method refactoring based on a path-based representation of code called code2vec that is able to capture the syntactic structure and semantic information of a code fragment. We use this code representation to train a machine learning classifier suggesting to move methods to more appropriate classes. We evaluate the approach on two publicly available datasets: a manually compiled dataset of well-known open-source projects and a synthetic dataset with automatically injected code smell instances. The results show that our approach is capable of recommending accurate refactoring opportunities and outperforms JDeodorant and JMove, which are state of the art tools in this field. +	refactoring
2020	Code Prediction by Feeding Trees to Transformers + + + + +	Seohyun Kim, Jinman Zhao, Yuchi Tian, Satish Chandra		In this paper, we describe how to leverage Transformer, a recent neural architecture for learning from sequential data (such as text), for code completion. As in the realm of natural language processing, Transformers surpass the prediction accuracy achievable by RNNs; we provide an experimental confirmation of this over a Python dataset. + + Furthermore, we show that the way to obtain even better accuracy from Transformers is to expose the syntactic structure of code, which is easily recovered by parsing, to the neural network. This works significantly better than presenting the code as a linear token sequence, which is how Transformers were originally intended to be used. + + To accomplish this, we propose a novel enhancement to the self-attention mechanism of the Transformer. We enable the mechanism to learn weights—that is, how much to focus on each preceding token in the input—not only on the basis of a token’s value, but also on the basis of the spatial relationships, as in their positions in the abstract syntax tree, between each pair of tokens. + + We provide comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Python corpus internal to Facebook. +	autocomplete
2020	PSCS: A Path-based Neural Model for Semantic Code Search + + + + +	Zhensu Sun, Yan Liu, Chen Yang, Yu Qian		To obtain code snippets for reuse, programmers prefer to search for related documents, e.g., blogs or Q&A, instead of code itself. The major reason is due to the semantic diversity and mismatch between queries and code snippets. Deep learning models have been proposed to address this challenge. Compared with approaches using information retrieval techniques, deep learning models do not suffer from the information loss caused by refining user intention into keywords. However, the performance of previous works is not satisfactory because they ignore the importance of code structure. When the semantics of code (e.g., identifier names, APIs) are ambiguous, code structure may be the only feature for the model to utilize. In that case, previous works relearn the structural information from lexical tokens of code, which is extremely difficult for a model without any domain knowledge. In this work, we propose PSCS, a path-based neural model for semantic code search. Our model encodes both the semantics and structures of code represented by AST paths. We train and evaluate our model over 330k-19k query-function pairs, respectively. The evaluation results demonstrate that PSCS achieves a SuccessRate of 47.6% and a Mean Reciprocal Rank (MRR) of 30.4% when considering the top-10 results with a match. The proposed approach significantly outperforms both DeepCS, the first approach that applies deep learning to code search task, and CARLCS, a state-of-the-art approach that introduces a co-attentive representation learning model on the basis of DeepCS. The importance of code structure is demonstrated with an ablation study on code features, which enlightens model design for further studies. +	grammar search
2020	Deep Just-In-Time Inconsistency Detection Between Comments and Source Code + + + + +	Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, Raymond J. Mooney		Natural language comments convey key aspects of source code such as implementation, usage, and pre- and post-conditions. Failure to update comments accordingly when the corresponding code is modified introduces inconsistencies, which is known to lead to confusion and software bugs. In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code, in order to catch potential inconsistencies just-in-time, i.e., before they are committed to a version control system. To achieve this, we develop a deep-learning approach that learns to correlate a comment with code changes. By evaluating on a large corpus of comment/code pairs spanning various comment types, we show that our model outperforms multiple baselines by significant margins. For extrinsic evaluation, we show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system which can both detect and resolve inconsistent comments based on code changes. +	edit bimodal documentation
2020	SCELMo: Source Code Embeddings from Language Models + + + + +	Rafael-Michael Karampatsis, Charles Sutton		Continuous embeddings of tokens in computer programs have been used to support a variety of software development tools, including readability, code search, and program repair. Contextual embeddings are common in natural language processing but have not been previously applied in software engineering. We introduce a new set of deep contextualized word representations for computer programs based on language models. We train a set of embeddings using the ELMo (embeddings from language models) framework of Peters et al (2018). We investigate whether these embeddings are effective when fine-tuned for the downstream task of bug detection. We show that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection. +	pretraining defect
2020	Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code + + + + +	Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes Charles Sutton, Andrea Janes	ICSE	Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. +	language model
2020	Pre-trained Contextual Embedding of Source Code + + + + +	Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi	ICML	The source code of a program not only serves as a formal description of an executable task, but it also serves to communicate developer intent in a human-readable form. To facilitate this, developers use meaningful identifier names and natural-language documentation. This makes it possible to successfully apply sequence-modeling approaches, shown to be effective in natural-language processing, to source code. A major advancement in natural-language understanding has been the use of pre-trained token embeddings; BERT and other works have further shown that pre-trained contextual embeddings can be extremely powerful and can be fine-tuned effectively for a variety of downstream supervised tasks. Inspired by these developments, we present the first attempt to replicate this success on source code. We curate a massive corpus of Python programs from GitHub to pre-train a BERT model, which we call Code Understanding BERT (CuBERT). We also pre-train Word2Vec embeddings on the same dataset. We create a benchmark of five classification tasks and compare fine-tuned CuBERT against sequence models trained with and without the Word2Vec embeddings. Our results show that CuBERT outperforms the baseline methods by a margin of 2.9-22%. We also show its superiority when fine-tuned with smaller datasets, and over fewer epochs. We further evaluate CuBERT’s effectiveness on a joint classification, localization and repair task involving prediction of two pointers. +	pretraining
2020	Learning Graph Structure With A Finite-State Automaton Layer + + + + +	Daniel D. Johnson, Hugo Larochelle, Daniel Tarlow		Graph-based neural network models are producing strong results in a number of domains, in part because graphs provide flexibility to encode domain knowledge in the form of relational structure (edges) between nodes in the graph. In practice, edges are used both to represent intrinsic structure (e.g., abstract syntax trees of programs) and more abstract relations that aid reasoning for a downstream task (e.g., results of relevant program analyses). In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. Motivated by their power in program analyses, we consider relations defined by paths on the base graph accepted by a finite-state automaton. We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies on a graph-based POMDP and then training these policies using implicit differentiation. The result is a differentiable Graph Finite-State Automaton (GFSA) layer that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs. Additionally, we combine the GFSA layer with a larger graph-based model trained end-to-end on the variable misuse program understanding task, and find that using the GFSA layer leads to better performance than using hand-engineered semantic edges or other baseline methods for adding learned edge types. +	GNN program analysis
2020	Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning + + + + +	Wei Ye, Rui Xie, Jinglei Zhang, Tianxiang Hu, Xiaoyin Wang, Shikun Zhang	WWW	Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connection between the two tasks as they train these tasks in a separate or pipeline manner, which means their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task. +	search summarization
2020	Contrastive Code Representation Learning + + + + +	Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez, Ion Stoica		Machine-aided programming tools such as type predictors and code summarizers +are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised +algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on +the raw text of programs. In particular, we design an unsupervised pretext task by +generating textually divergent copies of source functions via automated source-tosource compiler transforms that preserve semantics. We train a neural model to +identify variants of an anchor program within a large batch of negatives. To solve +this task, the network must extract program features representing the functionality, +not form, of the program. This is the first application of instance discrimination +to code representation learning to our knowledge. We pre-train models over 1.8m +unannotated JavaScript methods mined from GitHub. ContraCode pre-training +improves code summarization accuracy by 7.9% over supervised approaches and +4.8% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves +the accuracy of existing baselines. +	representation pretraining
2020	Blended, precise semantic program embeddings + + + + +	Ke Wang, Zhendong Su	PLDI	Learning neural program embeddings is key to utilizing deep neural networks in program languages research — precise and efficient program representations enable the application of deep models to a wide range of program analysis tasks. Existing approaches predominately learn to embed programs from their source code, and, as a result, they do not capture deep, precise program semantics. On the other hand, models learned from runtime information critically depend on the quality of program executions, thus leading to trained models with highly variant quality. This paper tackles these inherent weaknesses of prior approaches by introducing a new deep neural network, Liger, which learns program representations from a mixture of symbolic and concrete execution traces. We have evaluated Liger on two tasks: method name prediction and semantics classification. Results show that Liger is significantly more accurate than the state-of-the-art static model code2seq in predicting method names, and requires on average around 10x fewer executions covering nearly 4x fewer paths than the state-of-the-art dynamic model DYPRO in both tasks. Liger offers a new, interesting design point in the space of neural program embeddings and opens up this new direction for exploration. +	dynamic
2020	CC2Vec: Distributed Representations of Code Changes + + + + +	Thong Hoang, Hong Jin Kang, Julia Lawall, David Lo	ICSE	Existing work on software patches often use features specific to a single task. These works often rely on manually identified features, and human effort is required to identify these features for each task. In this work, we propose CC2Vec, a neural network model that learns a representation of code changes guided by their accompanying log messages, which represent the semantic intent of the code changes. CC2Vec models the hierarchical structure of a code change with the help of the attention mechanism and uses multiple comparison functions to identify the differences between the removed and added code. + + To evaluate if CC2Vec can produce a distributed representation of code changes that is general and useful for multiple tasks on software patches, we use the vectors produced by CC2Vec for three tasks: log message generation, bug fixing patch identification, and just-in-time defect prediction. In all tasks, the models using CC2Vec outperform the state-of-the-art techniques. +	edit
2020	Semantic Scaffolds for Pseudocode-to-Code Generation + + + + +	Ruiqi Zhong, Mitchell Stern, Dan Klein		We propose a method for program generation based on semantic scaffolds, lightweight structures representing the high-level semantic and syntactic composition of a program. By first searching over plausible scaffolds then using these as constraints for a beam search over programs, we achieve better coverage of the search space when compared with existing techniques. We apply our hierarchical search method to the SPoC dataset for pseudocode-to-code generation, in which we are given line-level natural language pseudocode annotations and aim to produce a program satisfying execution-based test cases. By using semantic scaffolds during inference, we achieve a 10% absolute improvement in top-100 accuracy over the previous state-of-the-art. Additionally, we require only 11 candidates to reach the top-3000 performance of the previous best approach when tested against unseen problems, demonstrating a substantial improvement in efficiency. +	code generation synthesis
2020	Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent + + + + +	Geert Heyman, Tom Van Cutsem		In this work, we propose and study annotated code search: the retrieval of code snippets paired with brief descriptions of their intent using natural language queries. On three benchmark datasets, we investigate how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets. Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description. We find that our model yields significantly more relevant search results (with absolute gains up to 20.6% in mean reciprocal rank) compared to state-of-the-art code retrieval methods that do not use descriptions but attempt to compute the intent of snippets solely from unannotated code. +	search
2020	Global Relational Models of Source Code + + + + +	Vincent J. Hellendoorn, Charles Sutton, Rishab Singh, Petros Maniatis, David Bieber	ICLR	Models of code can learn distributed representations of a program’s syntax and semantics to predict many non-trivial properties of a program. Recent state-of-the-art models leverage highly structured representations of programs, such as trees, graphs and paths therein (e.g. data-flow relations), which are precise and abundantly available for code. This provides a strong inductive bias towards semantically meaningful relations, yielding more generalizable representations than classical sequence-based models. Unfortunately, these models primarily rely on graph-based message passing to represent relations in code, which makes them de facto local due to the high cost of message-passing steps, quite in contrast to modern, global sequence-based models, such as the Transformer. In this work, we bridge this divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers (GREAT for short), which bias traditional Transformers with relational information from graph edge types. By studying a popular, non-trivial program repair task, variable-misuse identification, we explore the relative merits of traditional and hybrid model families for code representation. Starting with a graph-based model that already improves upon the prior state-of-the-art for this task by 20%, we show that our proposed hybrid models improve an additional 10-15%, while training both faster and using fewer parameters. +	variable misuse defect GNN Transformer
2020	Improving Code Search with Co-Attentive Representation Learning + + + + +	Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, Yan Lei	ICPC	Searching and reusing existing code from a large-scale codebase, e.g, GitHub, can help developers complete a programming task efficiently. Recently, Gu et al. proposed a deep learning-based model (i.e., DeepCS), which significantly outperformed prior models. The DeepCS embedded codebase and natural language queries into vectors by two LSTM (long and short-term memory) models separately, and returned developers the code with higher similarity to a code search query. However, such embedding method learned two isolated representations for code and query but ignored their internal semantic correlations. As a result, the learned isolated representations of code and query may limit the effectiveness of code search. + + To address the aforementioned issue, we propose a co-attentive representation learning model, i.e., Co-Attentive Representation Learning Code Search-CNN (CARLCS-CNN). CARLCS-CNN learns interdependent representations for the embedded code and query with a co-attention mechanism. Generally, such mechanism learns a correlation matrix between embedded code and query, and co-attends their semantic relationship via row/column-wise max-pooling. In this way, the semantic correlation between code and query can directly affect their individual representations. We evaluate the effectiveness of CARLCS-CNN on Gu et al.’s dataset with 10k queries. Experimental results show that the proposed CARLCS-CNN model significantly outperforms DeepCS by 26.72% in terms of MRR (mean reciprocal rank). Additionally, CARLCS-CNN is five times faster than DeepCS in model training and four times in testing. +	search
2020	Adaptive Deep Code Search + + + + +	Chunyang Ling, Zeqi Lin, Yanzhen Zou, Bing Xie	ICPC	Searching code in a large-scale codebase using natural language queries is a common practice during software development. Deep learning-based code search methods demonstrate superior performance if models are trained with large amount of text-code pairs. However, few deep code search models can be easily transferred from one codebase to another. It can be very costly to prepare training data for a new codebase and re-train an appropriate deep learning model. In this paper, we propose AdaCS, an adaptive deep code search method that can be trained once and transferred to new codebases. AdaCS decomposes the learning process into embedding domain-specific words and matching general syntactic patterns. Firstly, an unsupervised word embedding technique is used to construct a matching matrix to represent the lexical similarities. Then, a recurrent neural network is used to capture latent syntactic patterns from these matching matrices in a supervised way. As the supervised task learns general syntactic patterns that exist across domains, AdaCS is transferable to new codebases. Experimental results show that: when extended to new software projects never seen in the training data, AdaCS is more robust and significantly outperforms state-of-the-art deep code search methods. +	search
2020	Deep Graph Matching and Searching for Semantic Code Retrieval + + + + +	Xiang Ling, Lingfei Wu, Saizhuo Wang, Gaoning Pan, Tengfei Ma, Fangli Xu, Alex X. Liu, Chunming Wu, Shouling Ji	TKDD	Code retrieval is to find the code snippet from a large corpus of source code repositories that highly matches the query of natural language description. Recent work mainly uses natural language processing techniques to process both query texts (i.e., human natural language) and code snippets (i.e., machine programming language), however neglecting the deep structured features of query texts and source codes, both of which contain rich semantic information. In this paper, we propose an end-to-end deep graph matching and searching (DGMS) model based on graph neural networks for the task of semantic code retrieval. To this end, we first represent both natural language query texts and programming language code snippets with the unified graph-structured data, and then use the proposed graph matching and searching model to retrieve the best matching code snippet. In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them by cross-attention based semantic matching operations. We evaluate the proposed DGMS model on two public code retrieval datasets with two representative programming languages (i.e., Java and Python). Experiment results demonstrate that DGMS significantly outperforms state-of-the-art baseline models by a large margin on both datasets. Moreover, our extensive ablation studies systematically investigate and illustrate the impact of each part of DGMS. +	search GNN
2020	Static Neural Compiler Optimization via Deep Reinforcement Learning + + + + +	Rahim Mammadli, Ali Jannesari, Felix Wolf		The phase-ordering problem of modern compilers has received a lot of attention from the research community over the years, yet remains largely unsolved. Various optimization sequences exposed to the user are manually designed by compiler developers. In designing such a sequence developers have to choose the set of optimization passes, their parameters and ordering within a sequence. Resulting sequences usually fall short of achieving optimal runtime for a given source code and may sometimes even degrade the performance when compared to unoptimized version. In this paper, we employ a deep reinforcement learning approach to the phase-ordering problem. Provided with sub-sequences constituting LLVM’s O3 sequence, our agent learns to outperform the O3 sequence on the set of source codes used for training and achieves competitive performance on the validation set, gaining up to 1.32x speedup on previously-unseen programs. Notably, our approach differs from autotuning methods by not depending on one or more test runs of the program for making successful optimization decisions. It has no dependence on any dynamic feature, but only on the statically-attainable intermediate representation of the source code. We believe that the models trained using our approach can be integrated into modern compilers as neural optimization agents, at first to complement, and eventually replace the hand-crafted optimization sequences. +	compilation
2020	Learning Code-Query Interaction for Enhancing Code Searches + + + + +	Wei Li, Haozhe Qin, Shuhan Yan, Beijun Shen, Yuting Chen	ICSME	Code search plays an important role in software development and maintenance. In recent years, deep learning (DL) has achieved a great success in this domain-several DL-based code search methods, such as DeepCS and UNIF, have been proposed for exploring deep, semantic correlations between code and queries; each method usually embeds source code and natural language queries into real vectors followed by computing their vector distances representing their semantic correlations. Meanwhile, deep learning-based code search still suffers from three main problems, i.e., the OOV (Out of Vocabulary) problem, the independent similarity matching problem, and the small training dataset problem. To tackle the above problems, we propose CQIL, a novel, deep learning-based code search method. CQIL learns code-query interactions and uses a CNN (Convolutional Neural Network) to compute semantic correlations between queries and code snippets. In particular, CQIL employs a hybrid representation to model code-query correlations, which solves the OOV problem. CQIL also deeply learns the code-query interaction for enhancing code searches, which solves the independent similarity matching and the small training dataset problems. We evaluate CQIL on two datasets (CODEnn and CosBench). The evaluation results show the strengths of CQIL-it achieves the MAP@1 values, 0.694 and 0.574, on CODEnn and CosBench, respectively. In particular, it outperforms DeepCS and UNIF, two state-of-the-art code search methods, by 13.6% and 18.1% in MRR, respectively, when the training dataset is insufficient. +	search
2020	DLFix: Context-based Code Transformation Learning for Automated Program Repair + + + + +	Yi Li, Shaohua Wang, Tien N. Nguyen	ICSE	Automated Program Repair (APR) is very useful in helping developers in the process of software development and maintenance. Despite recent advances in deep learning (DL), the DL-based APR approaches still have limitations in learning bug-fixing code changes and the context of the surrounding source code of the bug-fixing code changes. These limitations lead to incorrect fixing locations or fixes. In this paper, we introduce DLFix, a two-tier DL model that treats APR as code transformation learning from the prior bug fixes and the surrounding code contexts of the fixes. The first layer is a tree-based RNN model that learns the contexts of bug fixes and its result is used as an additional weighting input for the second layer designed to learn the bug-fixing code transformations. + + We conducted several experiments to evaluate DLFix in two benchmarks: Defect4J and Bugs.jar, and a newly built bug datasets with a total of +20K real-world bugs in eight projects. We compared DLFix against a total of 13 state-of-the-art pattern-based APR tools. Our results show that DLFix can auto-fix more bugs than 11 of them, and is comparable and complementary to the top two pattern-based APR tools in which there are 7 and 11 unique bugs that they cannot detect, respectively, but we can. Importantly, DLFix is fully automated and data-driven, and does not require hard-coding of bug-fixing patterns as in those tools. We compared DLFix against 4 state-of-the-art deep learning based APR models. DLFix is able to fix 2.5 times more bugs than the best performing~baseline. +	edit repair grammar
2020	Where should I comment my code? A dataset and model for predicting locations that need comments + + + + +	Annie Louis, Santanu Kumar Dash, Earl T. Barr, Charles Sutton	International Conference on Software Engineering (ICSE; NIER track)	Programmers should write code comments, but not on every line +of code. We have created a machine learning model that suggests +locations where a programmer should write a code comment. We +trained it on existing commented code to learn locations that are +chosen by developers. Once trained, the model can predict locations +in new code. Our models achieved precision of 74% and recall of +13% in identifying comment-worthy locations. This first success +opens the door to future work, both in the new where-to-comment +problem and in guiding comment generation. +	bimodal documentation
2020	Automating Just-In-Time Comment Updating + + + + +	Zhongxin Liu, Xin Xia, Meng Yan, Shanping Li	ASE	Code comments are valuable for program comprehension and software maintenance, and also require maintenance with code evolution. However, when changing code, developers sometimes neglect updating the related comments, bringing in inconsistent or obsolete comments (aka., bad comments). Such comments are detrimental since they may mislead developers and lead to future bugs. Therefore, it is necessary to fix and avoid bad comments. In this work, we argue that bad comments can be reduced and even avoided by automatically performing comment updates with code changes. We refer to this task as “Just-In-Time (JIT) Comment Updating” and propose an approach named CUP (Comment UPdater) to automate this task. CUP can be used to assist developers in updating comments during code changes and can consequently help avoid the introduction of bad comments. Specifically, CUP leverages a novel neural sequence-to-sequence model to learn comment update patterns from extant code-comment co-changes and can automatically generate a new comment based on its corresponding old comment and code change. Several customized enhancements, such as a special tokenizer and a novel co-attention mechanism, are introduced in CUP by us to handle the characteristics of this task. We build a dataset with over 108K comment-code co-change samples and evaluate CUP on it. The evaluation results show that CUP outperforms an information-retrieval-based and a rule-based baselines by substantial margins, and can reduce developers’ edits required for JIT comment updating. In addition, the comments generated by our approach are identical to those updated by developers in 1612 (16.7%) test samples, 7 times more than the best-performing baseline. +	documentation
2020	TranS^3: A Transformer-based Framework for Unifying Code Summarization and Code Search + + + + +	Wenhua Wang, Yuqun Zhang, Zhengran Zeng, Guandong Xu		Code summarization and code search have been widely adopted in sofwaredevelopmentandmaintenance. However, fewstudieshave explored the efcacy of unifying them. In this paper, we propose TranS^3 , a transformer-based framework to integrate code summarization with code search. Specifcally, for code summarization,TranS^3 enables an actor-critic network, where in the actor network, we encode the collected code snippets via transformer- and tree-transformer-based encoder and decode the given code snippet to generate its comment. Meanwhile, we iteratively tune the actor network via the feedback from the critic network for enhancing the quality of the generated comments. Furthermore, we import the generated comments to code search for enhancing its accuracy. To evaluatetheefectivenessof TranS^3 , we conduct a set of experimental studies and case studies where the experimental results suggest that TranS^3 can signifcantly outperform multiple state-of-the-art approaches in both code summarization and code search and the study results further strengthen the efcacy of TranS^3 from the developers’ points of view. +	search documentation
2020	OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints + + + + +	Irene Vlassi Pandi, Earl T. Barr, Andrew D. Gordon, Charles Sutton		We present a new approach to the type inference problem for dynamic languages. Our goal is to combine logical constraints, that is, deterministic information from a type system, with natural constraints, uncertain information about types from sources like identifier names. To this end, we introduce a framework for probabilistic type inference that combines logic and learning: logical constraints on the types are extracted from the program, and deep learning is applied to predict types from surface-level code properties that are statistically associated, such as variable names. The main insight of our method is to constrain the predictions from the learning procedure to respect the logical constraints, which we achieve by relaxing the logical inference problem of type prediction into a continuous optimisation problem. To evaluate the idea, we built a tool called OptTyper to predict a TypeScript declaration file for a JavaScript library. OptTyper combines a continuous interpretation of logical constraints derived by a simple program transformation and static analysis of the JavaScript code, with natural constraints obtained from a deep learning model, which learns naming conventions for types from a large codebase. We evaluate OptTyper on a data set of 5,800 open-source JavaScript projects that have type annotations in the well-known DefinitelyTyped repository. We find that combining logical and natural constraints yields a large improvement in performance over either kind of information individually, and produces 50% fewer incorrect type predictions than previous approaches. +	types bimodal
2020	Associating Natural Language Comment and Source Code Entities + + + + +	Sheena Panthaplackel, Milos Gligoric, Raymond J. Mooney, Junyi Jessy Li	AAAI	Comments are an integral part of software development; they are natural language descriptions associated with source code elements. Understanding explicit associations can be useful in improving code comprehensibility and maintaining the consistency between code and comments. As an initial step towards this larger goal, we address the task of associating entities in Javadoc comments with elements in Java source code. We propose an approach for automatically extracting supervised data using revision histories of open source projects and present a manually annotated evaluation dataset for this task. We develop a binary classifier and a sequence labeling model by crafting a rich feature set which encompasses various aspects of code, comments, and the relationships between them. Experiments show that our systems outperform several baselines learning from the proposed supervision. +	dataset bimodal
2020	Copy that! Editing Sequences by Copying Spans + + + + +	Sheena Panthaplackel, Miltiadis Allamanis, Marc Brockschmidt		Neural sequence-to-sequence models are finding increasing use in editing of documents, for example in correcting a text document or repairing source code. In this paper, we argue that common seq2seq models (with a facility to copy single tokens) are not a natural fit for such tasks, as they have to explicitly copy each unchanged token. We present an extension of seq2seq models capable of copying entire spans of the input to the output in one step, greatly reducing the number of decisions required during inference. This extension means that there are now many ways of generating the same output, which we handle by deriving a new objective for training and a variation of beam search for inference that explicitly handle this problem. + + In our experiments on a range of editing tasks of natural language and source code, we show that our new model consistently outperforms simpler baselines. +	edit
2020	CoCoGUM: Contextual Code Summarization with Multi-Relational GNN on UMLs + + + + +	Yanlin Wang, Lun Du, Ensheng Shi, Yuxuan Hu, Shi Han, Dongmei Zhang		Code summaries are short natural language (NL) descriptions of code snippets that help developers better understand and maintain source code. Due to the pivotal role of code summaries in software development and maintenance, there is a surge of works on automatic code summarization to reduce the heavy burdens of developers. However, contemporary approaches only leverage the information within the boundary of the method being summarized (i.e., local context), and ignore that using broader context could assist with code summarization. In this paper, we explore two global context information, namely intra-class and inter-class context information, and propose the model CoCoGUM: Contextual Code Summarization with Multi-Relational Graph Neural Networks on UMLs. CoCoGUM first incorporates class names as the intra-class context, which is further fed to a Transformer-based sentence embedding model to extract the class lexical embeddings. Then, relevant Unified Modeling Language (UML) class diagrams are extracted as inter-class context and we use a Multi-Relational Graph Neural Network (MR-GNN) to encode the class relational embeddings. Class lexical embeddings and class relational embeddings, together with the outputs from code token encoder and AST encoder, are passed to the decoder armed with a two-level attention mechanism to generate high-quality context-aware code summaries. We conduct extensive experiments to evaluate our approach and compare it with other automatic code summarization models. The experimental results show that CoCoGUM outperforms state-of-the-art methods. +	summarization
2020	Embedding Java Classes with code2vec: Improvements from Variable Obfuscation + + + + +	Rhys Compton, Eibe Frank, Panos Patros, Abigail Koay	MSR	Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaining the semantics of the code as much as possible. code2vec is a recently released embedding approach that uses the proxy task of method name prediction to map Java methods to feature vectors. However, experimentation with code2vec shows that it learns to rely on variable names for prediction, causing it to be easily fooled by typos or adversarial attacks. Moreover, it is only able to embed individual Java methods and cannot embed an entire collection of methods such as those present in a typical Java class, making it difficult to perform predictions at the class level (e.g., for the identification of malicious Java classes). Both shortcomings are addressed in the research presented in this paper. We investigate the effect of obfuscating variable names during the training of a code2vec model to force it to rely on the structure of the code rather than specific names and consider a simple approach to creating class-level embeddings by aggregating sets of method embeddings. Our results, obtained on a challenging new collection of source-code classification problems, indicate that obfuscating variable names produces an embedding model that is both impervious to variable naming and more accurately reflects code semantics. The datasets, models, and code are shared for further ML research on source code. +	naming adversarial
2020	Towards Demystifying Dimensions of Source Code Embeddings + + + + +	Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, Mohammad Amin Alipour	RL+SE&PL (Co-located with ESEC/FSE)	Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations. +	evaluation representation naming interpretability
2020	Suggesting Comment Completions for Python using Neural Language Models + + + + +	Adelina Ciurumelea; Sebastian Proksch; Harald C. Gall	SANER	Source-code comments are an important communication medium between developers to better understand and maintain software. Current research focuses on auto-generating comments by summarizing the code. However, good comments contain additional details, like important design decisions or required trade-offs, and only developers can decide on the proper comment content. Automated summarization techniques cannot include information that does not exist in the code, therefore fully-automated approaches while helpful, will be of limited use. In our work, we propose to empower developers through a semi-automated system instead. We investigate the feasibility of using neural language models trained on a large corpus of Python documentation strings to generate completion suggestions and obtain promising results. By focusing on confident predictions, we can obtain a top-3 accuracy of over 70%, although this comes at the cost of lower suggestion frequency. Our models can be improved by leveraging context information like the signature and the full body of the method. Additionally, we are able to return good accuracy completions even for new projects, suggesting the generalizability of our approach. +	bimodal autocomplete documentation
2020	Unit Test Case Generation with Transformers + + + + +	Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, Neel Sundaresan	ICSE	Automated Unit Test Case generation has been the focus of extensive literature within the research community. Existing approaches are usually guided by the test coverage criteria, generating synthetic test cases that are often difficult to read or understand for developers. In this paper we propose AthenaTest, an approach that aims at generating unit test cases by learning from real-world, developer-written test cases. Our approach relies on a state-of-the-art sequence-to-sequence transformer model which is able to write useful test cases for a given method under test (i.e., focal method). We also introduce methods2test - the largest publicly available supervised parallel corpus of unit test case methods and corresponding focal methods in Java, which comprises 630k test cases mined from 70k open-source repositories hosted on GitHub. We use this dataset to train a transformer model to translate focal methods into the corresponding test cases. We evaluate the ability of our model in generating test cases using natural language processing as well as code-specific criteria. First, we assess the quality of the translation compared to the target test case, then we analyze properties of the test case such as syntactic correctness and number and variety of testing APIs (e.g., asserts). We execute the test cases, collect test coverage information, and compare them with test cases generated by EvoSuite and GPT-3. Finally, we survey professional developers on their preference in terms of readability, understandability, and testing effectiveness of the generated test cases. +	code generation synthesis test generation
2020	PyMT5: multi-mode translation of natural language and Python code with transformers + + + + +	Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, Neel Sundaresan	EMNLP	Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation. +	bimodal code generation summarization documentation language model pretraining
2020	Empirical Study of Transformers for Source Code + + + + +	Nadezhda Chirkova, Sergey Troshin		Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i. e. follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and all consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model. +	Transformer
2020	CodeBLEU: a Method for Automatic Evaluation of Code Synthesis + + + + +	Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, Shuai Ma		Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy. +	evaluation
2020	TAG : Type Auxiliary Guiding for Code Comment Generation + + + + +	Ruichu Cai, Zhihao Liang, Boyan Xu, Zijian Li, Yuexing Hao, Yao Chen	ACL	Existing leading code comment generation approaches with the structure-to-sequence framework ignores the type information of the interpretation of the code, e.g., operator, string, etc. However, introducing the type information into the existing framework is non-trivial due to the hierarchical dependence among the type information. In order to address the issues above, we propose a Type Auxiliary Guiding encoder-decoder framework for the code comment generation task which considers the source code as an N-ary tree with type information associated with each node. Specifically, our framework is featured with a Type-associated Encoder and a Type-restricted Decoder which enables adaptive summarization of the source code. We further propose a hierarchical reinforcement learning method to resolve the training difficulties of our proposed framework. Extensive evaluations demonstrate the state-of-the-art performance of our framework with both the auto-evaluated metrics and case studies. +	bimodal documentation
2020	OffSide: Learning to Identify Mistakes in Boundary Conditions + + + + +	Jón Arnar Briem, Jordi Smit, Hendrig Sellik, Pavel Rapoport, Georgios Gousios, Maurício Aniche.	2nd Workshop on Testing for Deep Learning and Deep Learning for Testing	Mistakes in boundary conditions are the cause of many bugs in software. +These mistakes happen when, e.g., developers make use of `<` or `>` in cases +where they should have used `<=` or `>=`. Mistakes in boundary conditions +are often hard to find and manually detecting them might be very time-consuming +for developers. While researchers have been proposing techniques to cope with +mistakes in the boundaries for a long time, the automated detection of such bugs still +remains a challenge. We conjecture that, for a tool to be able to precisely identify mistakes +in boundary conditions, it should be able to capture the overall context of the source code +under analysis. In this work, we propose a deep learning model that learn mistakes in boundary +conditions and, later, is able to identifythem in unseen code snippets. We train and test a +model on over 1.5 million code snippets, with and without mistakes in different boundary conditions. +Our model shows an accuracy from 55% up to 87%. The model is also able to detect 24 out of 41 +real-world bugs;however, with a high false positive rate. The existing state-of-the-practice linter +tools are not able to detect any of the bugs. We hope this paper can pave the road towards deep +learning models that will be able to support developers in detecting mistakes in boundary conditions. +	defect
2020	ComPy-Learn: A toolbox for exploring machine learning representations for compilers + + + + +	Alexander Brauckmann, Andrés Goens, Jeronimo Castrillon	FDL	Deep Learning methods have not only shown to improve software performance in compiler heuristics, but also e.g. to improve security in vulnerability prediction or to boost developer productivity in software engineering tools. A key to the success of such methods across these use cases is the expressiveness of the representation used to abstract from the program code. Recent work has shown that different such representations have unique advantages in terms of performance. However, determining the best-performing one for a given task is often not obvious and requires empirical evaluation. +Therefore, we present ComPy-Learn, a toolbox for conveniently defining, extracting, and exploring representations of program code. With syntax-level language information from the Clang compiler frontend and low-level information from the LLVM compiler backend, the tool supports the construction of linear and graph representations and enables an efficient search for the best-performing representation and model for tasks on program code. +	representation compilation optimization GNN
2020	A Structural Model for Contextual Code Changes + + + + +	Shaked Brody, Uri Alon, Eran Yahav	OOPSLA	We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, rather than learning the likelihood of the edited code. We represent an edit operation as a path in the program’s Abstract Syntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, we present a powerful and lightweight neural model for the EditCompletion task. We conduct a thorough evaluation, comparing our approach to a variety of representation and modeling approaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Our experiments show that our model achieves 28% relative gain over state-of-the-art sequential models and 2× higher accuracy than syntactic models that learn to generate the edited code instead of modeling the edits directly. Our code, dataset, and trained models are publicly available at https://github.com/tech-srl/c3po/ . +	edit grammar autocomplete
2020	Compiler-based graph representations for deep learning models of code + + + + +	Alexander Brauckmann, Andres Goens, Sebastian Ertel, Jeronimo Castrillon	CC	In natural language processing, novel methods in deep learning, like recurrent neural networks (RNNs) on sequences of words, have been very successful. These methods have also been used recently for tasks in compiler optimization, like heterogeneous mapping of OpenCL kernels or predicting thread coarsening factors for optimal execution times. In contrast to natural languages, programming languages usually have a well-defined structure. This structure is what enables compilers to reason about programs on the foundations of graphs, such as abstract syntax trees (ASTs) or control-data flow graphs (CDFGs). +In this paper, we argue that we should use these graph structures instead of word sequences for learning compiler optimization tasks. To this end we apply recently proposed graph neural networks (GNNs) for learning predictive compiler tasks on two representations based on ASTs and CDFGs. Experimental results show how these representations improve upon the accuracy of the state-of-the-art in the task of heterogeneous OpenCL mapping, while providing orders of magnitude faster inference times, which are crucial for compiler optimizations. When testing on benchmark suites not included for training, our graph-based methods significantly outperform the state-of-the art by 12 percentage points in terms of accuracy, and are the only ones to perform better than a random mapping. When testing on the task of predicting thread coarsening factors, we expose current limitations of deep learning in compilers. We show how all of the deep learning approaches proposed so far, including our graph-based models, fail to produce an overall speedup with their predictions. +	representation compilation optimization GNN
2020	Adversarial Robustness for Code + + + + +	Pavol Bielik, Martin Vechev		We propose a novel technique which addresses the challenge of learning accurate and robust models of code in a principled way. Our method consists of three key components: (i) learning to abstain from making a prediction if uncertain, (ii) adversarial training, and (iii) representation refinement which learns the program parts relevant for the prediction and abstracts the rest. These components are used to iteratively train multiple models, each of which learns a suitable program representation necessary to make robust predictions on a different subset of the dataset. We instantiated our approach to the task of type inference for dynamically typed languages and demonstrate its effectiveness by learning a model that achieves 88% accuracy and 84% robustness. Further, our evaluation shows that using the combination of all three components is key to obtaining accurate and robust models. +	adversarial types
2020	Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks + + + + +	David Bieber, Charles Sutton, Hugo Larochelle, Daniel Tarlow	NeurIPS	Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural networks (RNNs), on the other hand, are well-suited to long sequential chains of reasoning, but they do not naturally incorporate program structure and generally perform worse on the above tasks. Our aim is to achieve the best of both worlds, and we do so by introducing a novel GNN architecture, the Instruction Pointer Attention Graph Neural Networks (IPA-GNN), which achieves improved systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by considering RNNs operating on program traces with branch decisions as latent variables. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution. To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a heuristic function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks. +	representation dynamic
2020	SinkFinder: harvesting hundreds of unknown interesting function pairs with just one seed + + + + +	Pan Bian, Bin Liang, Jianjun Huang, Wenchang Shi, Xidong Wang, Jian Zhang	FSE	Mastering the knowledge about security-sensitive functions that can potentially result in bugs is valuable to detect them. However, identifying this kind of functions is not a trivial task. Introducing machine learning-based techniques to do the task is a natural choice. Unfortunately, the approach also requires considerable prior knowledge, e.g., sufficient labelled training samples. In practice, the requirement is often hard to meet. + + In this paper, to solve the problem, we propose a novel and practical method called SinkFinder to automatically discover function pairs that we are interested in, which only requires very limited prior knowledge. SinkFinder first takes just one pair of well-known interesting functions as the initial seed to infer enough positive and negative training samples by means of sub-word word embedding. By using these samples, a support vector machine classifier is trained to identify more interesting function pairs. Finally, checkers equipped with the obtained knowledge can be easily developed to detect bugs in target systems. The experiments demonstrate that SinkFinder can successfully discover hundreds of interesting functions and detect dozens of previously unknown bugs from large-scale systems, such as Linux, OpenSSL and PostgreSQL. +	program analysis
2020	OCoR: An Overlapping-Aware Code Retriever + + + + +	Qihao Zhu, Zeyu Sun, Xiran Liang, Yingfei Xiong, Lu Zhang	ASE	Code retrieval helps developers reuse the code snippet in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code among a set of code. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., “message” and “msg”), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related. To address these problems, we propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps: the first embeds identifiers by character to capture the overlaps between identifiers, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier. +The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of different components in OCoR. +	search
2020	MISIM: An End-to-End Neural Code Similarity System + + + + +	Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marcus, Nesime Tatbul, Jesmin Jahan Tithi, Paul Petersen, Timothy Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich		Code similarity systems are integral to a range of applications from code recommendation to automated construction of software tests and defect mitigation. In this paper, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters. We compare MISIM to three other state-of-the-art code similarity systems: (i) code2vec, (ii) Neural Code Comprehension, and (iii) Aroma. In our experimental evaluation across 45,780 programs, MISIM consistently outperformed all three systems, often by a large factor (upwards of 40.6x). +	code similarity
2020	Graph-based, Self-Supervised Program Repair from Diagnostic Feedback + + + + +	Michihiro Yasunaga, Percy Liang		We consider the problem of learning to repair programs from diagnostic feedback (e.g., compiler error messages). Program repair is challenging for two reasons: First, it requires reasoning and tracking symbols across source code and diagnostic feedback. Second, labeled datasets available for program repair are relatively small. In this work, we propose novel solutions to these two challenges. First, we introduce a program-feedback graph, which connects symbols relevant to program repair in source code and diagnostic feedback, and then apply a graph neural network on top to model the reasoning process. Second, we present a self-supervised learning paradigm for program repair that leverages unlabeled programs available online to create a large amount of extra program repair examples, which we use to pre-train our models. We evaluate our proposed approach on two applications: correcting introductory programming assignments (DeepFix dataset) and correcting the outputs of program synthesis (SPoC dataset). Our final system, DrRepair, significantly outperforms prior work, achieving 66.1% full repair rate on DeepFix (+20.8% over the prior best), and 48.0% synthesis success rate on SPoC (+3.3% over the prior best). +	repair edit GNN
2020	Learning Autocompletion from Real-World Datasets + + + + +	Gareth Ari Aye, Seohyun Kim, Hongyu Li		Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study When Code Completion Fails: a Case Study on Real-World Completions demonstrates that these results may not translate to improvements in real-world performance. To combat this effect, we train models on real-world code completion examples and find that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers’ actual autocompletion usage. Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models. +	autocomplete
2020	Predicting Vulnerability in Large Codebases With Deep Code Representation + + + + +	Anshul Tanwar, Krishna Sundaresan, Parmesh Ashwath, Prasanna Ganesan, Sathish Kumar Chandrasekaran, Sriram Ravi		Currently, while software engineers write code for various modules, quite often, various types of errors - coding, logic, semantic, and others (most of which are not caught by compilation and other tools) get introduced. Some of these bugs might be found in the later stage of testing, and many times it is reported by customers on production code. Companies have to spend many resources, both money and time in finding and fixing the bugs which would have been avoided if coding was done right. Also, concealed flaws in software can lead to security vulnerabilities that potentially allow attackers to compromise systems and applications. Interestingly, same or similar issues/bugs, which were fixed in the past (although in different modules), tend to get introduced in production code again. +We developed a novel AI-based system which uses the deep representation of Abstract Syntax Tree (AST) created from the source code and also the active feedback loop to identify and alert the potential bugs that could be caused at the time of development itself i.e. as the developer is writing new code (logic and/or function). This tool integrated with IDE as a plugin would work in the background, point out existing similar functions/code-segments and any associated bugs in those functions. The tool would enable the developer to incorporate suggestions right at the time of development, rather than waiting for UT/QA/customer to raise a defect. +We assessed our tool on both open-source code and also on Cisco codebase for C and C++ programing language. Our results confirm that deep representation of source code and the active feedback loop is an assuring approach for predicting security and other vulnerabilities present in the code. +	grammar program analysis static analysis
2020	Sequence Model Design for Code Completion in the Modern IDE + + + + +	Gareth Ari Aye, Gail E. Kaiser	Optional	Code completion plays a prominent role in modern integrated development environments (IDEs). Machine learning has become ubiquitous in analogous natural language writing and search software, surfacing more relevant autocompletions and search suggestions in fewer keystrokes. Prior research has reported training high-accuracy, deep neural networks for modeling source code, but little attention has been given to the practical constraints imposed by interactive developer tools. In particular, neural language models for source code modeling like the one described in Maybe Deep Neural Networks are the Best Choice for Modeling Source Code are framed around code completion, but only report accuracy of next-token prediction. However, in order for a language model (LM) to work well within real-world code completion systems, it must also always make suggestions that produce valid code that typechecks to support code completion’s role in correctness-checking; return instantaneous results to help programmers code more efficiently in fewer keystrokes; and be small enough to fit comfortably on disk and in memory on developer workstations, since virtually all modern IDEs run locally and support offline usage. To meet these additional requirements, we propose a novel design for predicting top-k next tokens that combines static analysis’ ability to enumerate all valid keywords and in-scope identifiers with the ability of a language model to place a probability distribution over them. Our model mixes character-level input representation with token output to represent out-of-vocabulary (OOV) tokens meaningfully and minimize prediction latency. OOV tokens can be predicted through detection of local repetition common in software. This design achieves state-of-art accuracy in source code modeling and fits the constraints imposed by real-world code completion implementations in modern IDEs. +	autocomplete
2020	Towards Learning Representations of Binary Executable Files for Security Tasks + + + + +	Shushan Arakelyan, Sima Arasteh, Christophe Hauser, Erik Kline, Aram Galstyan	AAAI	Tackling binary analysis problems has traditionally implied manually defining rules and heuristics. As an alternative, we are suggesting using machine learning models for learning distributed representations of binaries that can be applicable for a number of downstream tasks. We construct a computational graph from the binary executable and use it with a graph convolutional neural network to learn a high dimensional representation of the program. We show the versatility of this approach by using our representations to solve two semantically different binary analysis tasks – algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results and demonstrate improvement on the state of the art methods for both tasks. +	GNN representation
2020	LambdaNet: Probabilistic Type Inference using Graph Neural Networks + + + + +	Jiayi Wei, Maruth Goyal, Greg Durrett, Isil Dillig	ICLR	As gradual typing becomes increasingly popular in languages like Python and TypeScript, there is a growing need to infer type annotations automatically. While type annotations help with tasks like code completion and static error catching, these annotations cannot be fully inferred by compilers and are tedious to annotate by hand. This paper proposes a probabilistic type inference scheme for TypeScript based on a graph neural network. Our approach first uses lightweight source code analysis to generate a program abstraction called a type dependency graph, which links type variables with logical constraints as well as name and usage information. Given this program abstraction, we then use a graph neural network to propagate information between related type variables and eventually make type predictions. Our neural architecture can predict both standard types, like number or string, as well as user-defined types that have not been encountered during training. Our experimental results show that our approach outperforms prior work in this space by 14% (absolute) on library types, while having the ability to make type predictions that are out of scope for existing techniques. +	GNN types
2020	Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers + + + + +	Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, Neel Sundaresan	ICSE	Unit testing represents the foundational basis of the software testing pyramid, beneath integration and end-to-end testing. Automated software testing researchers have proposed a variety of techniques to assist developers in this time-consuming task. In this paper we present an approach to support developers in writing unit test cases by generating accurate and useful assert statements. Our approach is based on a state-of-the-art transformer model initially pretrained on an English textual corpus. This semantically rich model is then trained in a semi-supervised fashion on a large corpus of source code. Finally, we finetune this model on the task of generating assert statements for unit tests. The resulting model is able to generate accurate assert statements for a given method under test. In our empirical evaluation, the model was able to predict the exact assert statements written by developers in 62% of the cases in the first attempt. The results show 80% relative improvement for top-1 accuracy over the previous RNN-based approach in the literature. We also show the substantial impact of the pretraining process on the performances of our model, as well as comparing it with assert auto-completion task. Finally, we demonstrate how our approach can be used to augment EvoSuite test cases, with additional asserts leading to improved test coverage. +	code generation synthesis test generation
2020	Typilus: Neural Type Hints + + + + +	Miltiadis Allamanis, Earl T. Barr, Soline Ducousso, Zheng Gao	PLDI	Type inference over partial contexts in dynamically typed languages is challenging. In this work, we present a graph neural network model that predicts types by probabilistically reasoning over a program’s structure, names, and patterns. The network uses deep similarity learning to learn a TypeSpace – a continuous relaxation of the discrete space of types – and how to embed the type properties of a symbol (i.e. identifier) into it. Importantly, our model can employ one-shot learning to predict an open vocabulary of types, including rare and user-defined ones. We realise our approach in Typilus for Python that combines the TypeSpace with an optional type checker. We show that Typilus accurately predicts types. Typilus confidently predicts types for 70% of all annotatable symbols; when it predicts a type, that type optionally type checks 95% of the time. Typilus can also find incorrect type annotations; two important and popular open source libraries, fairseq and allennlp, accepted our pull requests that fixed the annotation errors Typilus discovered. +	types GNN
2020	A Transformer-based Approach for Source Code Summarization + + + + +	Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang	ACL	Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens’ position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research. +	summarization
2020	Graph4Code: A Machine Interpretable Knowledge Graph for Code + + + + +	Ibrahim Abdelaziz, Julian Dolby, James P. McCusker, Kavitha Srinivas		Knowledge graphs have proven extremely useful in powering diverse applications in semantic search and natural language understanding. Graph4Code is a knowledge graph about program code that can similarly power diverse applications such as program search, code understanding, refactoring, bug detection, and code automation. The graph uses generic techniques to capture the semantics of Python code: the key nodes in the graph are classes, functions and methods in popular Python modules. Edges indicate function usage (e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). We make extensive use of named graphs in RDF to make the knowledge graph extensible by the community. We describe a set of generic extraction techniques that we applied to over 1.3M Python files drawn from GitHub, over 2,300 Python modules, as well as 47M forum posts to generate a graph with over 2 billion triples. We also provide a number of initial use cases of the knowledge graph in code assistance, enforcing best practices, debugging and type inference. The graph and all its artifacts are available to the community for use. +	dataset
2020	Generating Adversarial Examples for Holding Robustness of Source Code Processing Models + + + + +	Huangzhao Zhang, Zhuo Li, Ge Li, Lei Ma, Yang Liu, Zhi Jin	AAAI	Automated processing, analysis, and generation of source code are among the key activities +in software and system life-cycle. To this end, while deep learning (DL) exhibits a certain level +of capability in handling these tasks, the current state-of-the-art DL models still suffer from +non-robust issues and can be easily fooled by adversarial attacks. + + Different from adversarial +attacks for image, audio, andnatural languages, the structured nature of programming +languages brings new challenges. In this paper, we propose a Metropolis-Hastings +sampling-based identifier renaming technique, named Metropolis-Hastings Modifier (MHM), +which generates adversarial examples for DL models specialized for source code processing. +Our in-depth evaluation on a functionality classification benchmark demonstrates the +effectiveness of MHM in generating adversarial examples of source code. The higher robustness +and performance enhanced through our adversarial training with MHM further confirms the usefulness +of DL models-based method for future fully automated source code processing. +	adversarial
2020	Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree + + + + +	Wenhan Wang, Ge Li, Bo Ma, Xin Xia, Zhi Jin	IEEE International Conference on Software Analysis, Evolution, and Reengineering	Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks. +	clone GNN
2020	Modular Tree Network for Source Code Representation Learning + + + + +	Wenhan Wang, Ge Li, Sijie Shen, Xin Xia, Zhi Jin	TOSEM	Learning representation for source code is a foundation of many program analysis tasks. In recent years, neural networks have already shown success in this area, but most existing models did not make full use of the unique structural information of programs. Although abstract syntax tree (AST)-based neural models can handle the tree structure in the source code, they cannot capture the richness of different types of substructure in programs. In this article, we propose a modular tree network that dynamically composes different neural network units into tree structures based on the input AST. Different from previous tree-structural neural network models, a modular tree network can capture the semantic differences between types of AST substructures. We evaluate our model on two tasks: program classification and code clone detection. Our model achieves the best performance compared with state-of-the-art approaches in both tasks, showing the advantage of leveraging more elaborate structure information of the source code. +	grammar representation
2020	Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks + + + + +	Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, Yang Liu	NeurIPS	Vulnerability identification is crucial to protect the software systems from attacks for cyber security. It is especially important to localize the vulnerable functions among the source code to facilitate the fix. However, it is a challenging and tedious process, and also requires specialized security expertise. Inspired by the work on manually-defined patterns of vulnerabilities from various code representation graphs and the recent advance on graph neural networks, we propose Devign, a general graph neural network based model for graph-level classification through learning on a rich set of code semantic representations. It includes a novel Conv module to efficiently extract useful features in the learned rich node representations for graph-level classification. The model is trained over manually labeled datasets built on 4 diversified large-scale open-source C projects that incorporate high complexity and variety of real source code instead of synthesis code used in previous works. The results of the extensive evaluation on the datasets demonstrate that Devign outperforms the state of the arts significantly with an average of 10.51% higher accuracy and 8.68% F1 score, increases averagely 4.66% accuracy and 6.37% F1 by the Conv module. +	GNN static analysis
2020	funcGNN: A Graph Neural Network Approach to Program Similarity + + + + +	Aravind Nair, Avijit Roy, Karl Meinke	ESEM	Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. We introduce funcGNN, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between high-level language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (0.00194), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with the state of the art methods. funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions. +	GNN clone
2020	Searching a Database of Source Codes Using Contextualized Code Search + + + + +	Rohan Mukherjee, Swarat Chaudhuri, Chris Jermaine		We assume a database containing a large set of program source codes and consider the problem of contextualized code search over that database. A programmer has written some part of a program, but has left part of the program (such as a method or a function body) incomplete. The goal is to use the context surrounding the missing code to automatically ‘figure out’ which of the codes in the database would be useful to the programmer in order to help complete the missing code, in the sense that the programmer could either re-purpose the retrieved code and use the re-purposed code to fill the missing spot in the program. Or, the user could use the retrieved code as a model for implementing the missing code. The search is ‘contextualized’ in the sense that the search engine should use clues in the partially-completed code to figure out which database code is most useful. The user should not be required to formulate an explicit query. + + We cast contextualized code search as a learning problem, where the goal is to learn a distribution function computing the likelihood that each database code completes the program, and propose a neural model for predicting which database code is likely to be most useful. Because it will be prohibitively expensive to apply a neural model to each code in a database of millions or billions of codes at search time, one of our key technical concerns is ensuring a speedy search. We address this by learning a ‘reverse encoder’ that can be used to reduce the problem of evaluating each database code to computing a convolution of two normal distributions, making it possible to search a large database of codes in a reasonable time. +	search representation
2020	A Survey on Deep Learning for Software Engineering + + + + +	Yanming Yang, Xin Xia, David Lo, John Grundy		In 2006, Geoffrey Hinton proposed the concept of training ‘‘Deep Neural Networks (DNNs)’’ and an improved model training method to break the bottleneck of neural network development. More recently, the introduction of AlphaGo in 2016 demonstrated the powerful learning ability of deep learning and its enormous potential. Deep learning has been increasingly used to develop state-of-the-art software engineering (SE) research tools due to its ability to boost performance for various SE tasks. There are many factors, e.g., deep learning model selection, internal structure differences, and model optimization techniques, that may have an impact on the performance of DNNs applied in SE. Few works to date focus on summarizing, classifying, and analyzing the application of deep learning techniques in SE. To fill this gap, we performed a survey to analyse the relevant studies published since 2006. We first provide an example to illustrate how deep learning techniques are used in SE. We then summarize and classify different deep learning techniques used in SE. We analyzed key optimization technologies used in these deep learning models, and finally describe a range of key research topics using DNNs in SE. Based on our findings, we present a set of current challenges remaining to be investigated and outline a proposed research road map highlighting key opportunities for future work. +	survey
2020	Learning to Represent Programs with Heterogeneous Graphs + + + + +	Wenhan Wang, Kechi Zhang, Ge Li, Zhi Jin		Program source code contains complex structure information, which can be represented in structured data forms like trees or graphs. To acquire the structural information in source code, most existing researches use abstract syntax trees (AST). A group of works add additional edges to ASTs to convert source code into graphs and use graph neural networks to learn representations for program graphs. Although these works provide additional control or data flow information to ASTs for downstream tasks, they neglect an important aspect of structure information in AST itself: the different types of nodes and edges. In ASTs, different nodes contain different kinds of information like variables or control flow, and the relation between a node and all its children can also be different. + + To address the information of node and edge types, we bring the idea of heterogeneous graphs to learning on source code and present a new formula of building heterogeneous program graphs from ASTs with additional type information for nodes and edges. We use the ASDL grammar of programming language to define the node and edge types of program graphs. Then we use heterogeneous graph neural networks to learn on these graphs. We evaluate our approach on two tasks: code comment generation and method naming. Both tasks require reasoning on the semantics of complete code snippets. Experiment results show that our approach outperforms baseline models, including homogeneous graph-based models, showing that leveraging the type information of nodes and edges in program graphs can help in learning program semantics. +	GNN summarization
2020	Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks + + + + +	Nikita Mehrotra, Navdha Agarwal, Piyush Gupta, Saket Anand, David Lo, Rahul Purandare		Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted in the past to detect clones. A majority of these approaches use lexical and syntactic information to detect clones. However, only a few of them target semantic clones. Recently, motivated by the success of deep learning models in other fields, including natural language processing and computer vision, researchers have attempted to adopt deep learning techniques to detect code clones. These approaches use lexical information (tokens) and(or) syntactic structures like abstract syntax trees (ASTs) to detect code clones. However, they do not make sufficient use of the available structural and semantic information hence, limiting their capabilities. + + This paper addresses the problem of semantic code clone detection using program dependency graphs and geometric neural networks, leveraging the structured syntactic and semantic information. We have developed a prototype tool HOLMES, based on our novel approach, and empirically evaluated it on popular code clone benchmarks. Our results show that HOLMES performs considerably better than the other state-of-the-art tool, TBCCD. We also evaluated HOLMES on unseen projects and performed cross dataset experiments to assess the generalizability of HOLMES. Our results affirm that HOLMES outperforms TBCCD since most of the pairs that HOLMES detected were either undetected or suboptimally reported by TBCCD. +	clone GNN
2020	Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries + + + + +	Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, Lingxiao Jiang	SANER	Code search methods, especially those that allow programmers to raise queries in a natural language, plays an important role in software development. It helps to improve programmers’ productivity by returning sample code snippets from the Internet and/or source-code repositories for their natural-language queries. Meanwhile, there are many code search methods in the literature that support natural-language queries. Difficulties exist in recognizing the strengths and weaknesses of each method and choosing the right one for different usage scenarios, because (1) the implementations of those methods and the datasets for evaluating them are usually not publicly available, and (2) some methods leverage different training datasets or auxiliary data sources and thus their effectiveness cannot be fairly measured and may be negatively affected in practical uses. To build a common ground for measuring code search methods, this paper builds CosBench, a dataset that consists of 1000 projects, 52 code-independent natural-language queries with ground truths, and a set of scripts for calculating four metrics on code research results. We have evaluated four IR (Information Retrieval)-based and two DL (Deep Learning)-based code search methods on CosBench. The empirical evaluation results clearly show the usefulness of the CosBench dataset and various strengths of each code search method. We found that DL-based methods are more suitable for queries on reusing code, and IR-based ones for queries on resolving bugs and learning API uses. +	search
2020	Suggesting Natural Method Names to Check Name Consistencies + + + + +	Son Nguyen, Hung Phan, Trinh Le, Tien N. Nguyen	ICSE	Misleading names of the methods in a project or the APIs in a software library confuse developers about program functionality +and API usages, leading to API misuses and defects. In this paper,we introduce MNire, a machine learning approach to check the +consistency between the name of a given method and its implementation. MNire first generates a candidate name and compares the +current name against it. If the two names are sufficiently similar, we consider the method as consistent. To generate the method name, +we draw our ideas and intuition from an empirical study on the nature of method names in a large dataset. Our key finding is that +high proportions of the tokens of method names can be found in the three contexts of a given method including its body, +the interface (the method’s parameter types and return type), and the enclosing class’ name. Even when such tokens are not there, +MNire uses the contexts to predict the tokens due to the high likelihoods of their co-occurrences. Our unique idea is to treat +the name generation as an abstract summarization on the tokens collected from the names of the program entities in the three +above contexts. + + We conducted several experiments to evaluate MNire in method name consistency checking and in method name +recommending on large datasets with +14M methods. In detecting inconsistency method names, MNire improves the state-of-the-art +approach by 10.4% and 11% relatively in recall and precision, respectively. In method name recommendation, MNire improves relatively +over the state-of-the-art technique, code2vec, in both recall (18.2% higher) and precision (11.1% higher). To assess MNire’s usefulness, +we used it to detect inconsistent methods and suggest new names in several active, GitHub projects. We made 50 pull requests (PRs) and received +42 responses. Among them, five PRs were merged into the main branch, and 13 were approved for later merging. In total, in 31/42 cases, +the developer teams agree that our suggested names are more meaningful than the current names, showing MNire’s usefulness. +	naming
2020	Learning Semantic Program Embeddings with Graph Interval Neural Network + + + + +	Yu Wang, Fengjuan Gao, Linzhang Wang, Ke Wang		Learning distributed representations of source code has been a challenging task for machine learning models. Earlier works treated programs as text so that natural language methods can be readily applied. Unfortunately, such approaches do not capitalize on the rich structural information possessed by source code. Of late, Graph Neural Network (GNN) was proposed to learn embeddings of programs from their graph representations. Due to the homogeneous and expensive message-passing procedure, GNN can suffer from precision issues, especially when dealing with programs rendered into large graphs. In this paper, we present a new graph neural architecture, called Graph Interval Neural Network (GINN), to tackle the weaknesses of the existing GNN. Unlike the standard GNN, GINN generalizes from a curated graph representation obtained through an abstraction method designed to aid models to learn. In particular, GINN focuses exclusively on intervals for mining the feature representation of a program, furthermore, GINN operates on a hierarchy of intervals for scaling the learning to large graphs. We evaluate GINN for two popular downstream applications: variable misuse prediction and method name prediction. Results show in both cases GINN outperforms the state-of-the-art models by a comfortable margin. We have also created a neural bug detector based on GINN to catch null pointer deference bugs in Java code. While learning from the same 9,000 methods extracted from 64 projects, GINN-based bug detector significantly outperforms GNN-based bug detector on 13 unseen test projects. Next, we deploy our trained GINN-based bug detector and Facebook Infer to scan the codebase of 20 highly starred projects on GitHub. Through our manual inspection, we confirm 38 bugs out of 102 warnings raised by GINN-based bug detector compared to 34 bugs out of 129 warnings for Facebook Infer. +	GNN defect
2019	A Neural Model for Method Name Generation from Functional Description + + + + +	Sa Gao, Chunyang Chen, Zhenchang Xing, Yukun Ma, Wen Song, Shang-Wei Lin	SANER	The names of software artifacts, e.g., method names, are important for software understanding and maintenance, as good names can help developers easily understand others’ code. However, the existing naming guidelines are difficult for developers, especially novices, to come up with meaningful, concise and compact names for the variables, methods, classes and files. With the popularity of open source, an enormous amount of project source code can be accessed, and the exhaustiveness and instability of manually naming methods could now be relieved by automatically learning a naming model from a large code repository. Nevertheless, building a comprehensive naming system is still challenging, due to the gap between natural language functional descriptions and method names. Specifically, there are three challenges: how to model the relationship between the functional descriptions and formal method names, how to handle the explosion of vocabulary when dealing with large repositories, and how to leverage the knowledge learned from large repositories to a specific project. To answer these questions, we propose a neural network to directly generate readable method names from natural language description. The proposed method is built upon the encoder-decoder framework with the attention and copying mechanisms. Our experiments show that our method can generate meaningful and accurate method names and achieve significant improvement over the state-of-the-art baseline models. We also address the cold-start problem using a training trick to utilize big data in GitHub for specific projects. +	naming summarization
2019	Coda: An End-to-End Neural Program Decompiler + + + + +	Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuandong Tian, Farinaz Koushanfar, Jishen Zhao	NeurIPS	Reverse engineering of binary executables is a critical problem in the computer security domain. On the one hand, malicious parties may recover interpretable source codes from the software products to gain commercial advantages. On the other hand, binary decompilation can be leveraged for code vulnerability analysis and malware detection. However, efficient binary decompilation is challenging. Conventional decompilers have the following major limitations: (i) they are only applicable to specific source-target language pair, hence incurs undesired development cost for new language tasks; (ii) their output high-level code cannot effectively preserve the correct functionality of the input binary; (iii) their output program does not capture the semantics of the input and the reversed program is hard to interpret. To address the above problems, we propose Coda1, the first end-to-end neural-based framework for code decompilation. Coda decomposes the decompilation task into of two key phases: First, Coda employs an instruction type-aware encoder and a tree decoder for generating an abstract syntax tree (AST) with attention feeding during the code sketch generation stage. Second, Coda then updates the code sketch using an iterative error correction machine guided by an ensembled neural error predictor. By finding a good approximate candidate and then fixing it towards perfect, Coda achieves superior with performance compared to baseline approaches. We assess Coda’s performance with extensive experiments on various benchmarks. Evaluation results show that Coda achieves an average of 82% program recovery accuracy on unseen binary samples, where the state-of-the-art decompilers yield 0% accuracy. Furthermore, Coda outperforms the sequence-to-sequence model with attention by a margin of 70% program accuracy. Our work reveals the vulnerability of binary executables and imposes a new threat to the protection of Intellectual Property (IP) for software development. +	decompilation
2019	A case study on machine learning for synthesizing benchmarks + + + + +	Andrés Goens, Alexander Brauckmann, Sebastian Ertel, Chris Cummins, Hugh Leather, Jeronimo Castrillon	MAPL	Good benchmarks are hard to find because they require a substantial effort to keep them representative for the constantly changing challenges of a particular field. Synthetic benchmarks are a common approach to deal with this, and methods from machine learning are natural candidates for synthetic benchmark generation. In this paper we investigate the usefulness of machine learning in the prominent CLgen benchmark generator. We re-evaluate CLgen by comparing the benchmarks generated by the model with the raw data used to train it. This re-evaluation indicates that, for the use case considered, machine learning did not yield additional benefit over a simpler method using the raw data. We investigate the reasons for this and provide further insights into the challenges the problem could pose for potential future generators. +	code generation
2019	Structured Neural Summarization + + + + +	Patrick Fernandes, Miltiadis Allamanis, Marc Brockschmidt	ICLR	Summarization of long sequences into a concise statement is a core problem in natural language processing, requiring non-trivial understanding of the input. Based on the promising results of graph neural networks on highly structured data, we develop a framework to extend existing sequence encoders with a graph component that can reason about long-distance relationships in weakly structured data such as text. In an extensive evaluation, we show that the resulting hybrid sequence-graph models outperform both pure sequence models as well as pure graph models on a range of summarization tasks. +	summarization GNN documentation
2019	A Novel Neural Source Code Representation based on Abstract Syntax Tree + + + + +	Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, Xudong Liu	ICSE	Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches. +	representation grammar
2019	Learning to Sport and Refactor Inconsistent Method Names + + + + +	Kui Liu, Dongsun Kim, Tegawendé F. Bissyandé, Taeyoung Kim, Kisub Kim, Anil Koyuncu, Suntae Kim, Yves Le Traon	ICSE	To ensure code readability and facilitate software maintenance, program methods must be named properly. In particular, method names must be consistent with the corresponding method implementations. Debugging method names remains an important topic in the literature, where various approaches analyze commonalities among method names in a large dataset to detect inconsistent method names and suggest better ones. We note that the state-of-the-art does not analyze the implemented code itself to assess consistency. We thus propose a novel automated approach to debugging method names based on the analysis of consistency between method names and method code. The approach leverages deep feature representation techniques adapted to the nature of each artifact. Experimental results on over 2.1 million Java methods show that we can achieve up to 15 percentage points improvement over the state-of-the-art, establishing a record performance of 67.9% F1-measure in identifying inconsistent method names. We further demonstrate that our approach yields up to 25% accuracy in suggesting full names, while the state-of-the-art lags far behind at 1.1% accuracy. Finally, we report on our success in fixing 66 inconsistent method names in a live study on projects in the wild. +	naming
2019	Neural-Network Guided Expression Transformation + + + + +	Romain Edelmann, Viktor Kunčak		Optimizing compilers, as well as other translator systems, often work by rewriting expressions according to equivalence preserving rules. Given an input expression and its optimized form, finding the sequence of rules that were applied is a non-trivial task. Most of the time, the tools provide no proof, of any kind, of the equivalence between the original expression and its optimized form. In this work, we propose to reconstruct proofs of equivalence of simple mathematical expressions, after the fact, by finding paths of equivalence preserving transformations between expressions. We propose to find those sequences of transformations using a search algorithm, guided by a neural network heuristic. Using a Tree-LSTM recursive neural network, we learn a distributed representation of expressions where the Manhattan distance between vectors approximately corresponds to the rewrite distance between expressions. We then show how the neural network can be efficiently used to search for transformation paths, leading to substantial gain in speed compared to an uninformed exhaustive search. In one of our experiments, our neural-network guided search algorithm is able to solve more instances with a 2 seconds timeout per instance than breadth-first search does with a 5 minutes timeout per instance. +	optimization grammar
2019	Unsupervised Learning of API Aliasing Specifications + + + + +	Jan Eberhardt, Samuel Steffen, Veselin Raychev, Martin Vechev	PLDI	Real world applications make heavy use of powerful libraries +and frameworks, posing a significant challenge for static analysis +as the library implementation may be very complex or unavailable. +Thus, obtaining specifications that summarize the behaviors of +the library is important as it enables static analyzers to precisely +track the effects of APIs on the client program, without requiring +the actual API implementation. + + In this work, we propose a novel method +for discovering aliasing specifications of APIs by learning from a large +dataset of programs. Unlike prior work, our method does not require +manual annotation, access to the library’s source code or ability to +run its APIs. Instead, it learns specifications in a fully unsupervised manner, +by statically observing usages of APIs in the dataset. The core idea is to +learn a probabilistic model of interactions between API methods and aliasing +objects, enabling identification of additional likely aliasing relations, +and to then infer aliasing specifications ofAPIs that explain these relations. +The learned specifications are then used to augment an API-aware points-to analysis. + + We implemented our approach in a tool called USpec and used it to automatically +learn aliasing specifications from millions of source code files. +USpec learned over 2000 specifications of various Java and Python APIs, in the process +improving the results of the points-to analysis and its clients. +	API program analysis
2019	Semantic Source Code Models Using Identifier Embeddings + + + + +	Vasiliki Efstathiou, Diomidis Spinellis	MSR	The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data-driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state-of-the-art library for learning word representations. Each model is trained on data from a single programming language; the code mined for producing all models amounts to over 13.000 repositories. We indicate dissimilarities between natural language and source code, as well as variations in coding conventions in between the different programming languages we processed. We describe how these heterogeneities guided the data preprocessing decisions we took and the selection of the training parameters in the released models. Finally, we propose potential applications of the models and discuss limitations of the models. +	representation
2019	Simulating Execution Time of Tensor Programs using Graph Neural Networks + + + + +	Jakub M. Tomczak, Romain Lepert, Auke Wiggers	Representation Learning on Graphs and Manifolds at ICLR	Optimizing the execution time of tensor program, e.g., a convolution, involves finding its optimal configuration. Searching the configuration space exhaustively is typically infeasible in practice. In line with recent research using TVM, we propose to learn a surrogate model to overcome this issue. The model is trained on an acyclic graph called an abstract syntax tree, and utilizes a graph convolutional network to exploit structure in the graph. We claim that a learnable graph-based data processing is a strong competitor to heuristic-based feature extraction. We present a new dataset of graphs corresponding to configurations and their execution time for various tensor programs. We provide baselines for a runtime prediction task. +	GNN
2019	Recovering Variable Names for Minified Code with Usage Contexts + + + + +	Hieu Tran, Ngoc Tran, Son Nguyen, Hoan Nguyen, Tien N. Nguyen	ICSE	In modern Web technology, JavaScript (JS) code plays an important role. To avoid the exposure of original source code, the variable names in JS code deployed in the wild are often replaced by short, meaningless names, thus making the code extremely difficult to manually understand and analysis. This paper presents JSNeat, an information retrieval (IR)-based approach to recover the variable names in minified JS code. JSNeat follows a data-driven approach to recover names by searching for them in a large corpus of open-source JS code. We use three types of contexts to match a variable in given minified code against the corpus including the context of properties and roles of the variable, the context of that variable and relations with other variables under recovery, and the context of the task of the function to which the variable contributes. We performed several empirical experiments to evaluate JSNeat on the dataset of more than 322K JS files with 1M functions, and 3.5M variables with 176K unique variable names. We found that JSNeat achieves a high accuracy of 69.1%, which is the relative improvements of 66.1% and 43% over two state-of-the-art approaches JSNice and JSNaughty, respectively. The time to recover for a file or for a variable with JSNeat is twice as fast as with JSNice and 4x as fast as with JNaughty, respectively. +	naming deobfuscation
2019	Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization + + + + +	Steven H. H. Ding, Benjamin C. M. Fung, Philippe Charland	IEEE Symposium on Security and Privacy	Reverse engineering is a manually intensive but necessary technique for understanding the inner workings of new malware, finding vulnerabilities in existing systems, and detecting patent infringements in released software. An assembly clone search engine facilitates the work of reverse engineers by identifying those duplicated or known parts. However, it is challenging to design a robust clone search engine, since there exist various compiler optimization options and code obfuscation techniques that make logically similar assembly functions appear to be very different. A practical clone search engine relies on a robust vector representation of assembly code. However, the existing clone search approaches, which rely on a manual feature engineering process to form a feature vector for an assembly function, fail to consider the relationships between features and identify those unique patterns that can statistically distinguish assembly functions. To address this problem, we propose to jointly learn the lexical semantic relationships and the vector representation of assembly functions based on assembly code. We have developed an assembly code representation learning model \emph{Asm2Vec}. It only needs assembly code as input and does not require any prior knowledge such as the correct mapping between assembly functions. It can find and incorporate rich semantic relationships among tokens appearing in assembly code. We conduct extensive experiments and benchmark the learning model with state-of-the-art static and dynamic clone search approaches. We show that the learned representation is more robust and significantly outperforms existing methods against changes introduced by obfuscation and optimizations. +	representation clone
2019	NL2Type: Inferring JavaScript Function Types from Natural Language Information + + + + +	Rabee Sohail Malik, Jibesh Patra, Michael Pradel	ICSE	JavaScript is dynamically typed and hence lacks thetype safety of statically typed languages, +leading to suboptimal IDE support, difficult to understand APIs, and unexpected run-time behavior. +Several gradual type systems have been proposed, e.g., Flow and TypeScript, but they rely on developers +to annotatecode with types. This paper presents NL2Type, a learning-based approach for predicting likely +type signatures of JavaScript functions. The key idea is to exploit natural language information in +source code, such as comments, function names, and parameternames, a rich source of knowledge +that is typically ignored by type inference algorithms. We formulate the problem of predicting +types as a classification problem and train a recurrent, LSTM-based neural model that, after learning +from an annotatedcode base, predicts function types for unannotated code. We evaluate the +approach with a corpus of 162,673 JavaScript files from real-world projects. +NL2Type predicts types with aprecision of 84.1% and a recall of 78.9% when considering only +the top-most suggestion, and with a precision of 95.5% and arecall of 89.6% when +considering the top-5 suggestions. The +approach outperforms both JSNice, a state-of-the-art approach that analyzes implementations +of functions instead of natural language information, and DeepTyper, a recent type prediction +approach that is also based on deep learning. Beyond predicting types, NL2Type serves as a +consistency checker for existing type annotations. We show that it discovers 39 inconsistencies +that deserve developer attention (from a manual analysis of 50 warnings), most of which +are due to incorrect type annotations. +	bimodal types
2019	Neural Attribution for Semantic Bug-Localization in Student Programs + + + + +	Rahul Gupta, Aditya Kanade, Shirish Shevade	NeurIPS	Providing feedback is an integral part of teaching. Most open online courses on programming make use of automated grading systems to support programming assignments and give real-time feedback. These systems usually rely on test results to quantify the programs’ functional correctness. They return failing tests to the students as feedback. However, students may find it difficult to debug their programs if they receive no hints about where the bug is and how to fix it. In this work, we present NeuralBugLocator, a deep learning based technique, that can localize the bugs in a faulty program with respect to a failing test, without even running the program. At the heart of our technique is a novel tree convolutional neural network which is trained to predict whether a program passes or fails a given test. To localize the bugs, we analyze the trained network using a state-of-the-art neural prediction attribution technique and see which lines of the programs make it predict the test outcomes. Our experiments show that NeuralBugLocator is generally more accurate than two state-of-the-art program-spectrum based and one syntactic difference based bug-localization baselines. +	defect representation
2019	Learning Uniform Semantic Features for Natural Language and Programming Language Globally, Locally and Sequentially + + + + +	Yudong Zhang, Wenhao Zheng, Ming Li	AAAI	Semantic feature learning for natural language and programming language is a preliminary step in addressing many software mining tasks. Many existing methods leverage +information in lexicon and syntax to learn features for textual data. +However, such information is inadequate to represent the entire semantics in either text sentence or code snippet. This +motivates us to propose a new approach to learn semantic +features for both languages, through extracting three levels of +information, namely global, local and sequential information, +from textual data. For tasks involving both modalities, we +project the data of both types into a uniform feature space so +that the complementary knowledge in between can be utilized +in their representation. In this paper, we build a novel and +general-purpose feature learning framework called UniEmbed, to uniformly learn comprehensive semantic representation for both natural language and programming language. +Experimental results on three real-world software mining +tasks show that UniEmbed outperforms state-of-the-art models in feature learning and prove the capacity and effectiveness of our model. +	representation bimodal
2019	SampleFix: Learning to Correct Programs by Sampling Diverse Fixes + + + + +	Hossein Hajipour, Apratim Bhattacharyya, Cristian-Alexandru Staicu, Mario Fritz		Automatic program correction is an active topic of research, which holds the potential of dramatically improving productivity of programmers during the software development process and correctness of software in general. Recent advances in machine learning, deep learning and NLP have rekindled the hope to eventually fully automate the process of repairing programs. A key challenges is ambiguity, as multiple codes – or fixes – can implement the same functionality. In addition, dataset by nature fail to capture the variance introduced by such ambiguities. Therefore, we propose a deep generative model to automatically correct programming errors by learning a distribution of potential fixes. Our model is formulated as a deep conditional variational autoencoder that samples diverse fixes for the given erroneous programs. In order to account for ambiguity and inherent lack of representative datasets, we propose a novel regularizer to encourage the model to generate diverse fixes. Our evaluations on common programming errors show for the first time the generation of diverse fixes and strong improvements over the state-of-the-art approaches by fixing up to 61% of the mistakes. +	repair code generation
2019	Neural Bug Finding: A Study of Opportunities and Challenges + + + + +	Andrew Habib, Michael Pradel		Static analysis is one of the most widely adopted techniques to find software bugs before code is put in production. Designing and implementing effective and efficient static analyses is difficult and requires high expertise, which results in only a few experts able to write such analyses. This paper explores the opportunities and challenges of an alternative way of creating static bug detectors: neural bug finding. The basic idea is to formulate bug detection as a classification problem, and to address this problem with neural networks trained on examples of buggy and non-buggy code. We systematically study the effectiveness of this approach based on code examples labeled by a state-of-the-art, static bug detector. Our results show that neural bug finding is surprisingly effective for some bug patterns, sometimes reaching a precision and recall of over 80%, but also that it struggles to understand some program properties obvious to a traditional analysis. A qualitative analysis of the results provides insights into why neural bug finders sometimes work and sometimes do not work. We also identify pitfalls in selecting the code examples used to train and validate neural bug finders, and propose an algorithm for selecting effective training data. +	program analysis
2019	Import2vec - Learning Embeddings for Software Libraries + + + + +	Bart Theeten, Frederik Vandeputte, Tom Van Cutsem	MSR	We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning tasks (e.g. classification) or applications such as contextual search and analogical reasoning. + + We apply word embedding techniques from natural language processing (NLP) to train embeddings for library packages (“library vectors”). Library vectors represent libraries by similar context of use as determined by import statements present in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python). +	representation
2019	Learning to Fix Build Errors with Graph2Diff Neural Networks + + + + +	Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton, Edward Aftandilian		Professional software developers spend a significant amount oftime fixing builds, but this has received little attention as a prob-lem in automatic program repair. We present a new deep learningarchitecture, called Graph2Diff, for automatically localizing andfixing build errors. We represent source code, build configurationfiles, and compiler diagnostic messages as a graph, and then use aGraph Neural Network model to predict a diff. A diff specifies howto modify the code’s abstract syntax tree, represented in the neuralnetwork as a sequence of tokens and of pointers to code locations.Our network is an instance of a more general abstraction which wecall Graph2Tocopo, which is potentially useful in any developmenttool for predicting source code changes. We evaluate the model ona dataset of over 500k real build errors and their resolutions fromprofessional developers. Compared to the approach of DeepDelta, our approach tackles the harder task of predicting a moreprecise diff but still achieves over double the accuracy. +	edit repair
2019	DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing + + + + +	Xiao Liu, Xiaoting Li, Rupesh Prajapati, Dinghao Wu	AAAI	Compilers are among the most fundamental programming +tools for building software. However, production compilers +remain buggy. Fuzz testing is often leveraged with newly-generated, +or mutated inputs in order to find new bugs or security vulnerabilities. +In this paper, we propose a grammar-based fuzzing tool called DeepFuzz. Based on a generative +Sequence-to-Sequence model, DeepFuzz automatically and continuously generates well-formed +C programs. We use this set of new C programs to fuzz off-the-shelf C compilers, e.g. GCC and Clang/LLVM. +We present a detailed case study to analyze the success rate and coverage improvement of the +generated C programs for fuzz testing. We analyze the performance of DeepFuzz with three types of sampling +methods as well as three types of generation strategies. Consequently, DeepFuzz +improved the testing efficacy in regards to the line, function, and branch coverage. In our preliminary +study, we found and reported 8 bugs of GCC, all of which are actively being addressed by developers. +	fuzzing code generation
2019	Learning Execution through Neural Code Fusion + + + + +	Zhan Shi, Kevin Swersky, Daniel Tarlow, Parthasarathy Ranganathan, Milad Hashemi		As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn representations of source code, these representations do not understand how code dynamically executes. In this work, we propose a new approach to use GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and complex data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related task (algorithm classification). +	representation
2019	Generating commit messages from diffs using pointer-generator network + + + + +	Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, Yu Qian.	MSR	The commit messages in source code repositories are valuable but not easy to be generated manually in time for tracking issues, reporting bugs, and understanding codes. Recently published works indicated that the deep neural machine translation approaches have drawn considerable attentions on automatic generation of commit messages. However, they could not deal with out-of-vocabulary (OOV) words, which are essential context-specific identifiers such as class names and method names in code diffs. In this paper, we propose PtrGNCMsg, a novel approach which is based on an improved sequence-to-sequence model with the pointer-generator network to translate code diffs into commit messages. By searching the smallest identifier set with the highest probability, PtrGNCMsg outperforms recent approaches based on neural machine translation, and first enables the prediction of OOV words. The experimental results based on the corpus of diffs and manual commit messages from the top 2,000 Java projects in GitHub show that PtrGNCMsg outperforms the state-of-the-art approach with improved BLEU by 1.02, ROUGE-1 by 4.00 and ROUGE-L by 3.78, respectively. +	edit
2019	TypeWriter: Neural Type Prediction with Search-based Validation + + + + +	Michael Pradel, Georgios Gousios, Jason Liu, Satish Chandra.		Maintaining large code bases written in dynamically typed languages, such as JavaScript or Python, can be challenging: simple data compatibility errors proliferate, IDE support is lacking and APIs are harder to comprehend. Recent work attempts to address those issues through either static analysis or probabilistic type inference. Unfortunately, static type inference for dynamic languages is inherently limited, while probabilistic approaches suffer from imprecision. This paper presents TypeWriter, the first combination of probabilistic prediction with search-based refinement of predicted types. TypeWriter’s predictor learns to infer the return and argument types for functions from partially annotated code bases by combining the natural language properties of code with programming language-level information. To validate predicted types, TypeWriter invokes a gradual type checker with different combinations of the predicted types, while navigating the space of possible type combinations in a feedback-directed manner. We implement the TypeWriter approach for Python and evaluate it on two code corpora: a multi-million line code base at Facebook and a collection of 500 popular open-source projects. We show that TypeWriter’s type predictor achieves a precision of 64% (91%) and a recall of 52% (68%) in the top-1 (top-5) predictions, and demonstrate that usage contexts are a helpful addition to neural type predictors. By combining predictions with search-based validation, TypeWriter can fully annotate between 42% to 64% of the files in a randomly selected corpus, while ensuring type correctness. A comparison with a static type inference tool shows that TypeWriter adds many more non-trivial types. Overall, TypeWriter provides developers with an effective way to help with the transition to fully type-annotated code. +	types bimodal
2019	Neural Reverse Engineering of Stripped Binaries + + + + +	Yaniv David, Uri Alon, Eran Yahav	ICLR	We address the problem of predicting procedure names in stripped executables which contain no debug information. +Predicting procedure names can dramatically ease the task of reverse engineering, saving precious time and human effort. +We present a novel approach that leverages static analysis of binaries with encoder-decoder-based neural networks. +The main idea is to use static analysis to obtain enriched representations of API call sites; encode a set of sequences +of these call sites; and finally, attend to the encoded sequences while decoding the target name token-by-token. +We evaluate our model by predicting procedure names over 60,000 procedures in 10,000 stripped executables. +Our model achieves 81.70 precision and 80.12 recall in predicting procedure names within GNU packages, and 55.48 +precision and 51.31 recall in a diverse, cross-package, dataset. Comparing to previous approaches, +the predictions made by our model are much more accurate and informative. +	naming deobfuscation GNN
2019	Adversarial Examples for Models of Code + + + + +	Noam Yefet, Uri Alon, Eran Yahav		Neural models of code have shown impressive performance for tasks such as predicting method names and identifying certain kinds of bugs. In this paper, we show that these models are vulnerable to adversarial examples, and introduce a novel approach for attacking trained models of code with adversarial examples. The main idea is to force a given trained model to make an incorrect prediction as specified by the adversary by introducing small perturbations that do not change the program’s semantics. To find such perturbations, we present a new technique for Discrete Adversarial Manipulation of Programs (DAMP). DAMP works by deriving the desired prediction with respect to the model’s inputs while holding the model weights constant and following the gradients to slightly modify the code. + + To defend a model against such attacks, we propose placing a defensive model (Anti-DAMP) in front of it. Anti-DAMP detects unlikely mutations and masks them before feeding the input to the downstream model. + + We show that our DAMP attack is effective across three neural architectures: code2vec, GGNN, and GNN-FiLM, in both Java and C#. We show that DAMP has up to 89% success rate in changing a prediction to the adversary’s choice (“targeted attack”), and a success rate of up to 94% in changing a given prediction to any incorrect prediction (“non-targeted attack”). By using Anti-DAMP, the success rate of the attack drops drastically for both targeted and non-targeted attacks, with a minor penalty of 2% relative degradation in accuracy while not performing under attack. +	adversarial
2019	On the Feasibility of Transfer-learning Code Smells using Deep Learning + + + + +	Tushar Sharma, Vasiliki Efstathiou, Panos Louridas, Diomidis Spinellis		Context: A substantial amount of work has been done to detect smells in source code using metrics-based and heuristics-based methods. Machine learning methods have been recently applied to detect source code smells; however, the current practices are considered far from mature. + + Objective: First, explore the feasibility of applying deep learning models to detect smells without extensive feature engineering, just by feeding the source code in tokenized form. Second, investigate the possibility of applying transfer-learning in the context of deep learning models for smell detection. + + Method: We use existing metric-based state-of-the-art methods for detecting three implementation smells and one design smell in C# code. Using these results as the annotated gold standard, we train smell detection models on three different deep learning architectures. These architectures use Convolution Neural Networks (CNNs) of one or two dimensions, or Recurrent Neural Networks (RNNs) as their principal hidden layers. For the first objective of our study, we perform training and evaluation on C# samples, whereas for the second objective, we train the models from C# code and evaluate the models over Java code samples. We perform the experiments with various combinations of hyper-parameters for each model. + + Results: We find it feasible to detect smells using deep learning methods. Our comparative experiments find that there is no clearly superior method between CNN-1D and CNN-2D. We also observe that performance of the deep learning models is smell-specific. Our transfer-learning experiments show that transfer-learning is definitely feasible for implementation smells with performance comparable to that of direct-learning. This work opens up a new paradigm to detect code smells by transfer-learning especially for the programming languages where the comprehensive code smell detection tools are not available. +	representation program analysis
2019	Using GGNN to recommend log statement level + + + + +	Mingzhe Li, Jianrui Pei, Jin He, Kevin Song, Frank Che, Yongfeng Huang, Chitai Wang		In software engineering, log statement is an important part because programmers can’t access to users’ program and they can only rely on log message to find the root of bugs. The mechanism of “log level” allows developers and users to specify the appropriate amount of logs to print during the execution of the software. And 26\% of the log statement modification is to modify the level. We tried to use ML method to predict the suitable level of log statement. The specific model is GGNN(gated graph neural network) and we have drawn lessons from Microsoft’s research. In this work, we apply Graph Neural Networks to predict the usage of log statement level of some open source java projects from github. Given the good performance of GGNN in this task, we are confident that GGNN is an excellent choice for processing source code. We envision this model can play an important role in applying AI/ML technique for Software Development Life Cycle more broadly. +	GNN logging
2019	DeepDelta: Learning to Repair Compilation Errors + + + + +	Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, Edward Aftandilian.		Programmers spend a substantial amount of time manually repairing +code that does not compile. We observe that the repairs for +any particular error class typically follow a pattern and are highly +mechanical. We propose a novel approach that automatically learns +these patterns with a deep neural network and suggests program +repairs for the most costly classes of build-time compilation failures. +We describe how we collect all build errors and the human-authored, +in-progress code changes that cause those failing builds to transition +to successful builds at Google. We generate an AST diff from the +textual code changes and transform it into a domain-specific +language called Delta that encodes the change that must be made +to make the code compile. We then feed the compiler diagnostic +information (as source) and the Delta changes that resolved the +diagnostic (as target) into a Neural Machine Translation network for +training. For the two most prevalent and costly classes of Java compilation errors, +namely missing symbols and mismatched methodsignatures, our system called DeepDelta, +generates the correct repair changes for 19,314 out of 38,788 (50%) of unseen compilation +errors. The correct changes are in the top three suggested axes 86% of the time on average. +	repair edit compilation
2019	Commit2Vec: Learning Distributed Representations of Code Changes + + + + +	Adelina Ciurumelea; Sebastian Proksch; Harald C. Gall		Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). + + In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source changes (i.e., commits). We use this representation to classify security-relevant commits. + + Because our method uses transfer learning (that is, we train a network on a “pretext task” for which abundant labeled data is available, and then we use such network for the target task of commit classification, for which fewer labeled instances are available), we studied the impact of pre-training the network using two different pretext tasks versus a randomly initialized model. + + Our results indicate that representations that leverage the structural information obtained through code syntax outperform token-based representations. Furthermore, the performance metrics obtained when pre-training on a loosely related pretext task with a very large dataset (>10e6 samples) were surpassed when pretraining on a smaller dataset (>10e4 samples) but for a pretext task that is more closely related to the target task. +	edit
2019	Testing Neural Program Analyzers + + + + +	Md Rafiqul Islam Rabin, Ke Wang, Mohammad Amin Alipour	ASE (LBR-Track)	Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts to develop effective techniques for testing neural programs. We discuss the challenges involved in developing such tools and our future plans. In our preliminary experiment on a neural model recently proposed in the literature, we found that the model is very brittle, and simple perturbations in the input can cause the model to make mistakes in its prediction. +	evaluation refactoring
2019	Scalable Taint Specification Inference with Big Code + + + + +	V. Chibotaru, B. Bichsel, Veselin Raychev, Martin Vechev	PLDI	We present a new scalable, semi-supervised method for inferring +taint analysis specifications by learning from a large dataset of programs. +Taint specifications capture the role of library APIs (source, sink, sanitizer) +and are a critical ingredient of any taint analyzer that aims to detect +security violations based on information flow. + + The core idea of our method +is to formulate the taint specification learning problem as a linear +optimization task over a large set of information flow constraints. +The resulting constraint system can then be efficiently solved with +state-of-the-art solvers. Thanks to its scalability, our method can infer +many new and interesting taint specifications by simultaneously learning from +a large dataset of programs (e.g., as found on GitHub), while requiring +few manual annotations. + + We implemented our method in an end-to-end system, +called Seldon, targeting Python, a language where static specification +inference is particularly hard due to lack of typing information. +We show that Seldon is practically effective: it learned almost 7,000 API +roles from over 210,000 candidate APIs with very little supervision +(less than 300 annotations) and with high estimated precision (67%). +Further,using the learned specifications, our taint analyzer flagged more than +20,000 violations in open source projects, 97% of which were +undetectable without the inferred specifications. +	defect program analysis
2019	Neural Program Repair by Jointly Learning to Localize and Repair + + + + +	Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, Rishabh Singh	ICLR	Due to its potential to improve programmer productivity and software quality, automated program repair has been an active topic of research. Newer techniques harness neural networks to learn directly from examples of buggy programs and their fixes. In this work, we consider a recently identified class of bugs called variable-misuse bugs. The state-of-the-art solution for variable misuse enumerates potential fixes for all possible bug locations in a program, before selecting the best prediction. We show that it is beneficial to train a model that jointly and directly localizes and repairs variable-misuse bugs. We present multi-headed pointer networks for this purpose, with one head each for localization and repair. The experimental results show that the joint model significantly outperforms an enumerative solution that uses a pointer based model for repair alone. +	repair program analysis variable misuse
2019	Multi-Modal Attention Network Learning for Semantic Source Code Retrieval + + + + +	Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, Philip S. Yu		Code retrieval techniques and tools have been playing a key role in facilitating software developers to retrieve existing code fragments from available open-source repositories given a user query. Despite the existing efforts in improving the effectiveness of code retrieval, there are still two main issues hindering them from being used to accurately retrieve satisfiable code fragments from large-scale repositories when answering complicated queries. First, the existing approaches only consider shallow features of source code such as method names and code tokens, but ignoring structured features such as abstract syntax trees (ASTs) and control-flow graphs (CFGs) of source code, which contains rich and well-defined semantics of source code. Second, although the deep learning-based approach performs well on the representation of source code, it lacks the explainability, making it hard to interpret the retrieval results and almost impossible to understand which features of source code contribute more to the final results. + + To tackle the two aforementioned issues, this paper proposes MMAN, a novel Multi-Modal Attention Network for semantic source code retrieval. A comprehensive multi-modal representation is developed for representing unstructured and structured features of source code, with one LSTM for the sequential tokens of code, a Tree-LSTM for the AST of code and a GGNN (Gated Graph Neural Network) for the CFG of code. Furthermore, a multi-modal attention fusion layer is applied to assign weights to different parts of each modality of source code and then integrate them into a single hybrid representation. Comprehensive experiments and analysis on a large-scale real-world dataset show that our proposed model can accurately retrieve code snippets and outperforms the state-of-the-art methods. +	search
2019	Learning Scalable and Precise Representation of Program Semantics + + + + +	Ke Wang		Neural program embedding has shown potential in aiding the analysis of large-scale, complicated software. Newly proposed deep neural architectures pride themselves on learning program semantics rather than superficial syntactic features. However, by considering the source code only, the vast majority of neural networks do not capture a deep, precise representation of program semantics. In this paper, we present \dypro, a novel deep neural network that learns from program execution traces. Compared to the prior dynamic models, not only is \dypro capable of generalizing across multiple executions for learning a program’s dynamic semantics in its entirety, but \dypro is also more efficient when dealing with programs yielding long execution traces. For evaluation, we task \dypro with semantic classification (i.e. categorizing programs based on their semantics) and compared it against two prominent static models: Gated Graph Neural Network and TreeLSTM. We find that \dypro achieves the highest prediction accuracy among all models. To further reveal the capacity of all aforementioned deep neural architectures, we examine if the models can learn to detect deeper semantic properties of a program. In particular given a task of recognizing loop invariants, we show \dypro beats all static models by a wide margin. +	representation dynamic
2019	SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair + + + + +	Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, Martin Monperrus		This paper presents a novel end-to-end approach to program repair based on sequence-to-sequence learning. We devise, implement, and evaluate a system, called SequenceR, for fixing bugs based on sequence-to-sequence learning on source code. This approach uses the copy mechanism to overcome the unlimited vocabulary problem that occurs with big code. Our system is data-driven; we train it on 35,578 commits, carefully curated from open-source repositories. We evaluate it on 4,711 independent real bug fixes, as well on the Defects4J benchmark used in program repair research. SequenceR is able to perfectly predict the fixed line for 950/4711 testing samples. It captures a wide range of repair operators without any domain-specific top-down design. +	repair code generation
2019	A Literature Study of Embeddings on Source Code + + + + +	Zimin Chen, Martin Monperrus		Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and discuss the usage of word embedding techniques on programs and source code. The articles in this survey have been collected by asking authors of related work and with an extensive search on Google Scholar. Each article is categorized into five categories: 1. embedding of tokens 2. embedding of functions or methods 3. embedding of sequences or sets of method calls 4. embedding of binary code 5. other embeddings. We also provide links to experimental data and show some remarkable visualization of code embeddings. In summary, word embedding has been successfully applied on different granularities of source code. With access to countless open-source repositories, we see a great potential of applying other data-driven natural language processing techniques on source code in the future. +	representation
2019	Mining Likely Analogical APIs across Third-Party Libraries via Large-Scale Unsupervised API Semantics Embedding + + + + +	Chunyang Chen, Zhenchang Xing, Yang Liu, Kent Ong Long Xiong	TSE	Establishing API mappings between third-party libraries is a prerequisite step for library migration tasks. Manually establishing API mappings is tedious due to the large number of APIs to be examined. Having an automatic technique to create a database of likely API mappings can significantly ease the task. Unfortunately, existing techniques either adopt supervised learning mechanism that requires already-ported or functionality similar applications across major programming languages or platforms, which are difficult to come by for an arbitrary pair of third-party libraries, or cannot deal with lexical gap in the API descriptions of different libraries. To overcome these limitations, we present an unsupervised deep learning based approach to embed both API usage semantics and API description (name and document) semantics into vector space for inferring likely analogical API mappings between libraries. Based on deep learning models trained using tens of millions of API call sequences, method names and comments of 2.8 millions of methods from 135,127 GitHub projects, our approach significantly outperforms other deep learning or traditional information retrieval (IR) methods for inferring likely analogical APIs. We implement a proof-of-concept website which can recommend analogical APIs for 583,501 APIs of 111 pairs of analogical Java libraries with diverse functionalities. This scale of third-party analogical-API database has never been achieved before. +	API representation
2019	Capturing source code semantics via tree-based convolution over API-enhanced AST + + + + +	Long Chen, Wei Ye, Shikun Zhang	Computing Frontiers	When deep learning meets big code, a key question is how to efficiently learn a distributed representation for source code that can capture its semantics effectively. We propose to use tree-based convolution over API-enhanced AST. To demonstrate the effectiveness of our approach, we apply it to detect semantic clones—code fragments with similar semantics but dissimilar syntax. Experiment results show that our approach outperforms an existing state-of-the-art approach that uses tree-based LSTM, with an increase of 0.39 and 0.12 in F1-score on OJClone and BigCloneBench respectively. We further propose architectures that incorporate our approach for code search and code summarization. +	grammar representation
2019	When Deep Learning Met Code Search + + + + +	Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, Satish Chandra		There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural language queries, into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including unsupervised techniques, which rely only on a corpus of code examples, and supervised techniques, which use an aligned corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet. + + Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a minimal supervision extension to an existing unsupervised technique. + + Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus. +	search
2019	SAR: Learning Cross-Language API Mappings with Little Knowledge + + + + +	N. D. Q. Bui, Y. Yu, L. Jiang	FSE	To save manual effort, developers often translate programs from one programming language to another, instead of implementing it from scratch. Translating application program interfaces (APIs) used in one language to functionally equivalent ones available in another language is an important aspect of program translation. Existing approaches facilitate the translation by automatically identifying the API mappings across programming languages. However, all these approaches still require large amount of manual effort in preparing parallel program corpora, ranging from pairs of APIs, to manually identified code in different languages that are considered as functionally equivalent. To minimize the manual effort in identifying parallel program corpora and API mappings, this paper aims at an automated approach to map APIs across languages with much less knowledge a priori needed than other existing approaches. The approach is based on an realization of the notion of domain adaption combined with code embedding, which can better align two vector spaces: taking as input large sets of programs, our approach first generates numeric vector representations of the programs, especially the APIs used in each language, and it adapts generative adversarial networks (GAN) to align the vectors from the spaces of two languages. For a better alignment, we initialize the GAN with parameters derived from optional API mapping seeds that can be identified accurately with a simple automatic signature-based matching heuristic. Then the cross-language API mappings can be identified via nearest-neighbors queries in the aligned vector spaces. +	representation API
2019	STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms + + + + +	Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, Egor Bulychev	MSR	Source code reviews are manual, time-consuming, and expensive. Human involvement should be focused on analyzing the most relevant aspects of the program, such as logic and maintainability, rather than amending style, syntax, or formatting defects. Some tools with linting capabilities can format code automatically and report various stylistic violations for supported programming languages. They are based on rules written by domain experts, hence, their configuration is often tedious, and it is impractical for the given set of rules to cover all possible corner cases. Some machine learning-based solutions exist, but they remain uninterpretable black boxes. This paper introduces STYLE-ANALYZER, a new open source tool to automatically fix code formatting violations using the decision tree forest model which adapts to each codebase and is fully unsupervised. STYLE-ANALYZER is built on top of our novel assisted code review framework, Lookout. It accurately mines the formatting style of each analyzed Git repository and expresses the found format patterns with compact human-readable rules. STYLE-ANALYZER can then suggest style inconsistency fixes in the form of code review comments. We evaluate the output quality and practical relevance of STYLE-ANALYZER by demonstrating that it can reproduce the original style with high precision, measured on 19 popular JavaScript projects, and by showing that it yields promising results in fixing real style mistakes. STYLE-ANALYZER includes a web application to visualize how the rules are triggered. We release STYLE-ANALYZER as a reusable and extendable open source software package on GitHub for the benefit of the community. +	style
2019	Generative Code Modeling with Graphs + + + + +	Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, Oleksandr Polozov	ICLR	Generative models forsource code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem that uses a graph to represent the intermediate state of the generated output. Our model generates code by interleaving grammar-driven expansion steps with graph augmentation and neural message passing steps. An experimental evaluation shows that our new model can generate semantically meaningful expressions, outperforming a range of strong baselines. +	grammar code generation GNN
2019	Learning to Fuzz from Symbolic Execution with Application to Smart Contracts + + + + +	Jingxuan He, Mislav Balunović, Nodar Ambroladze, Petar Tsankov, Martin Vechev	CCS	Fuzzing and symbolic execution are two complementary techniques for discovering software vulnerabilities. Fuzzing is fast and scalable, but can be ineffective when it fails to randomly select the right inputs. Symbolic execution is thorough but slow and often does not scale to deep program paths with complex path conditions. In this work, we propose to learn an effective and fast fuzzer from symbolic execution, by phrasing the learning task in the framework of imitation learning. During learning, a symbolic execution expert generates a large number of quality inputs improving coverage on thousands of programs. Then, a fuzzing policy, represented with a suitable architecture of neural networks, is trained on the generated dataset. The learned policy can then be used to fuzz new programs. We instantiate our approach to the problem of fuzzing smart contracts, a domain where contracts often implement similar functionality (facilitating learning) and security is of utmost importance. We present an end-to-end system, ILF (for Imitation Learning based Fuzzer), and an extensive evaluation over >18K contracts. Our results show that ILF is effective: (i) it is fast, generating 148 transactions per second, (ii) it outperforms existing fuzzers (e.g., achieving 33% more coverage), and (iii) it detects more vulnerabilities than existing fuzzing and symbolic execution tools for Ethereum. +	fuzzing GNN
2019	Graph-based Mining of In-the-Wild, Fine-grained, Semantic Code Change Patterns + + + + +	Hoan Anh Nguyen, Tien N. Nguyen, Danny Dig, Son Nguyen, Hieu Tran, and Michael Hilton	ICSE	Existing approaches for detecting repetitive code changes relying on syntactic similarity cannot effectively detect semantic change patterns. In this work, we introduce a novel graph-based mining approach, CPatMiner, which is capable of detecting semantic code change patterns from a large number of open-source repositories by capturing dependencies between fine-grained change elements. We evaluated CPatMiner by mining change patterns in a diverse corpus of 5,000+ open-source projects from GitHub with 170,000+ developers. We use three complementary methods. First, we sent the mined patterns to the authors and received 108 responses. 70% of respondents recognized those patterns as their meaningful frequent changes. 79% of respondents even named the patterns, and 44% wanted IDEs to automate such repetitive changes. The mined patterns belong to various activities: adaptive (9%), perfective (20%), corrective (35%) and preventive (36%). Second, we compared CPatMiner with the state-of-the-art, AST-based technique, and reported that CPatMiner detects 2.1x more meaningful patterns. Third, we used CPatMiner to search for patterns in a corpus of 88 GitHub projects with longer histories consisting of 164M SLOCs. It constructed 322K fine-grained change graphs containing 3M nodes, and detected 17K change patterns which provide unique insights on the practice of change patterns among individuals and teams. We found that a large percentage (75%) of the patterns from individual developers are commonly shared with others, and this holds true for teams. Moreover, we found that the patterns spread widely over time. Thus, we call for a community-based change pattern database to provide important resources in novel applications. +	edit pattern mining
2019	Pythia: AI-assisted Code Completion System + + + + +	Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, Neel Sundaresan	KDD	In this paper, we propose a novel end-to-end approach for AI-assisted code completion called Pythia. It generates ranked lists of method and API recommendations which can be used by software developers at edit time. The system is currently deployed as part of Intellicode extension in Visual Studio Code IDE. Pythia exploits state-of-the-art large-scale deep learning models trained on code contexts extracted from abstract syntax trees. It is designed to work at a high throughput predicting the best matching code completions on the order of 100 ms. + + We describe the architecture of the system, perform comparisons to frequency-based approach and invocation-based Markov Chain language model, and discuss challenges serving Pythia models on lightweight client devices. + + The offline evaluation results obtained on 2700 Python open source software GitHub repositories show a top-5 accuracy of 92%, surpassing the baseline models by 20% averaged over classes, for both intra and cross-project settings. + +	autocomplete language model
2019	AutoPandas: neural-backed generators for program synthesis + + + + +	Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, Ion Stoica	OOPSLA	Developers nowadays have to contend with a growing number of APIs. While in the long-term they are very useful to developers, many modern APIs have an incredibly steep learning curve, due to their hundreds of functions handling many arguments, obscure documentation, and frequently changing semantics. For APIs that perform data transformations, novices can often provide an I/O example demonstrating the desired transformation, but may be stuck on how to translate it to the API. A programming-by-example synthesis engine that takes such I/O examples and directly produces programs in the target API could help such novices. Such an engine presents unique challenges due to the breadth of real-world APIs, and the often-complex constraints over function arguments. We present a generator-based synthesis approach to contend with these problems. This approach uses a program candidate generator, which encodes basic constraints on the space of programs. We introduce neural-backed operators which can be seamlessly integrated into the program generator. To improve the efficiency of the search, we simply use these operators at non-deterministic decision points, instead of relying on domain-specific heuristics. We implement this technique for the Python pandas library in AutoPandas. AutoPandas supports 119 pandas dataframe transformation functions. We evaluate AutoPandas on 26 real-world benchmarks and find it solves 17 of them. +	synthesis GNN API
2019	Method name suggestion with hierarchical attention networks + + + + +	Sihan Xu, Sen Zhang, Weijing Wang, Xinya Cao, Chenkai Guo, Jing Xu.	PEPM	Method Rename has been a widely used refactoring operation that improves program comprehension and maintenance. Descriptive method names that summarize functionalities of source code can facilitate program comprehension. Much research has been done to suggest method names through source code summarization. However, unlike natural language, a code snippet consists of basic blocks organized by complicated structures. In this work, we observe a hierarchical structure — tokens form basic blocks and basic blocks form a code snippet. Based on this observation, we exploit a hierarchical attention network to learn the representation of methods. Specifically, we apply two-level attention mechanism to learn the importance of each token in a basic block and that of a basic block in a method respectively. We evaluated our approach on 10 open source repositories and compared it against three state-of-the-art approaches. The results on these open-source data show the superiority of our hierarchical attention networks in terms of effectiveness. +	naming
2019	Code Mapping in Heterogeneous Platforms Using Deep Learning and LLVM-IR + + + + +	Francesco Barchi, Gianvito Urgese, Enrico Macii, Andrea Acquaviva	DAC	Modern heterogeneous platforms require compilers capable of choosing the appropriate device for the execution of program portions. This paper presents a machine learning method designed for supporting mapping decisions through the analysis of the program source code represented in LLVM assembly language (IR) for exploiting the advantages offered by this generalised and optimised representation. To evaluate our solution, we trained an LSTM neural network on OpenCL kernels compiled in LLVM-IR and processed with our tokenizer capable of filtering less-informative tokens. We tested the network that reaches an accuracy of 85% in distinguishing the best computational unit. +	optimization program analysis static analysis natural language processing
2019	Automatic Source Code Summarization with Extended Tree-LSTM + + + + +	Yusuke Shido, Yasuaki Kobayashi, Akihiro Yamamoto, Atsushi Miyamoto, Tadayuki Matsumura	International Joint Conference on Neural Networks	Neural machine translation models are used to automatically generate a document from given source code since this can be regarded as a machine translation task. Source code summarization is one of the components for automatic document generation, which generates a summary in natural language from given source code. This suggests that techniques used in neural machine translation, such as Long Short-Term Memory (LSTM), can be used for source code summarization. However, there is a considerable difference between source code and natural language: Source code is essentially structured, having loops and conditional branching, etc. Therefore, there is some obstacle to apply known machine translation models to source code.Abstract syntax trees (ASTs) capture these structural properties and play an important role in recent machine learning studies on source code. Tree-LSTM is proposed as a generalization of LSTMs for tree-structured data. However, there is a critical issue when applying it to ASTs: It cannot handle a tree that contains nodes having an arbitrary number of children and their order simultaneously, which ASTs generally have such nodes. To address this issue, we propose an extension of Tree-LSTM, which we call Multi-way Tree-LSTM and apply it for source code summarization. As a result of computational experiments, our proposal achieved better results when compared with several state-of-the-art techniques. +	summarization grammar
2019	code2vec: Learning Distributed Representations of Code + + + + +	Uri Alon, Omer Levy, Eran Yahav	POPL	We present a neural model for representing snippets of code as continuous distributed vectors (“code embeddings”). + The main idea is to represent a code snippet as a single fixed-length +code vector, which can be used to +predict semantic properties of the snippet. To this end, code is first decomposed to a collection of paths in its +abstract syntax tree. Then, the network learns the atomic representation of each path while +simultaneously +learning how to aggregate a set of them. + + We demonstrate the effectiveness of our approach by using it to predict a method’s name from the vector +representation of its body. We evaluate our approach by training a model on a dataset of 12M methods. We +show that code vectors trained on this dataset can predict method names from files that were unobserved +during training. Furthermore, we show that our model learns useful method name vectors that capture +semantic similarities, combinations, and analogies. + + A comparison of our approach to previous techniques over the same dataset shows an improvement of +more than 75%, making it the first to successfully predict method names based on a large, cross-project +corpus. Our trained model, visualizations and vector similarities are available as an interactive online demo at +http://code2vec.org. The code, data and trained models are available at +https://github.com/tech-srl/code2vec. +	naming summarization representation
2019	Structural Language Models for Any-Code Generation + + + + +	Uri Alon, Roy Sadaka, Omer Levy, Eran Yahav		We address the problem of Any-Code Generation (AnyGen) - generating code without any restriction on the vocabulary or structure. The state-of-the-art in this problem is the sequence-to-sequence (seq2seq) approach, which treats code as a sequence and does not leverage any structural information. We introduce a new approach to AnyGen that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program’s abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous structural techniques that have severely restricted the kinds of expressions that can be generated, our approach can generate arbitrary expressions in any programming language. Our model significantly outperforms both seq2seq and a variety of existing structured approaches in generating Java and C# code. We make our code, datasets, and models available online. +	code generation
2019	Neural Networks for Modeling Source Code Edits + + + + +	Rui Zhao, David Bieber, Kevin Swersky, Daniel Tarlow		Programming languages are emerging as a challenging and interesting domain for machine learning. A core task, which has received significant attention in recent years, is building generative models of source code. However, to our knowledge, previous generative models have always been framed in terms of generating static snapshots of code. In this work, we instead treat source code as a dynamic object and tackle the problem of modeling the edits that software developers make to source code files. This requires extracting intent from previous edits and leveraging it to generate subsequent edits. We develop several neural networks and use synthetic data to test their ability to learn challenging edit patterns that require strong generalization. We then collect and train our models on a large-scale dataset of Google source code, consisting of millions of fine-grained edits from thousands of Python developers. From the modeling perspective, our main conclusion is that a new composition of attentional and pointer network components provides the best overall performance and scalability. From the application perspective, our results provide preliminary evidence of the feasibility of developing tools that learn to predict future edits. +	edit
2019	The Adverse Effects of Code Duplication in Machine Learning Models of Code + + + + +	Miltiadis Allamanis		The field of big code relies on mining large corpora of code to perform some learning task. A significant threat to this approach has been recently identified by Lopes et al. (2017) who found a large amount of code duplication on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this article, we study the effect of code duplication to machine learning models showing that reported metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present an “errata” for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them, and release tools to help the community avoid this problem in future research. +	dataset evaluation
2019	Code Generation as a Dual Task of Code Summarization + + + + +	Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, Zhi Jin	NeurIPS	Code summarization (CS) and code generation (CG) are two crucial tasks in the field of automatic software development. Various neural network-based approaches are proposed to solve these two tasks separately. However, there exists a specific intuitive correlation between CS and CG, which have not been exploited in previous work. In this paper, we apply the relations between two tasks to improve the performance of both tasks. In other words, exploiting the duality between the two tasks, we propose a dual training framework to train the two tasks simultaneously. In this framework, we consider the dualities on probability and attention weights, and design corresponding regularization terms to constrain the duality. We evaluate our approach on two datasets collected from GitHub, and experimental results show that our dual framework can improve the performance of CS and CG tasks over baselines. +	code generation summarization
2019	Commit Message Generation for Source Code Changes + + + + +	Shengbin Xu, Yuan Yao, Feng Xu, Tianxiao Gu, Hanghang Tong, Jian Lu	IJCAI	Commit messages, which summarize the source +code changes in natural language, are essential for +program comprehension and software evolution understanding. Unfortunately, due to the lack of direct +motivation, commit messages are sometimes neglected by developers, making it necessary to +automatically generate such messages. State-of-the-art adopts learning based approaches such as +neural machine translation models for the commitmessage generation problem. However, they tend +to ignore the code structure information and suffer from the out-of-vocabulary issue. +In this paper, we propose CODISUM to address the above two limitations. In particular, +we first extract both code structure and code semantics from the source code changes, and then +jointly model these two sources of information so as to better learn the representations + of the code changes. Moreover, we augment the model with copying mechanism to further +mitigate the out-of-vocabulary issue. Experimental evaluations on real data demonstrate that +the proposed approach significantly outperforms the state-of-the-art in terms of accurately generating the commit messages. +	edit summarization
2019	Learning Lenient Parsing & Typing via Indirect Supervision + + + + +	Toufique Ahmed, Vincent Hellendoorn, Premkumar Devanbu		Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in StackOverflow; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed & typed; this makes them more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse & type fragments, even ones with simple errors. Training a machine learner to leniently parse & type imperfect code requires a large training set of pairs of imperfect code and its repair (and/or type information); such training sets are limited by human effort and curation. In this paper, we present a novel indirectly supervised approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments (while requiring correct output) by seeding errors that mimic corruptions found in StackOverflow and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing & typing StackOverflow fragments; we also demonstrate that our approach achieves best-in-class performance on a large dataset of student errors. +	types
2019	code2seq: Generating Sequences from Structured Representations of Code + + + + +	Uri Alon, Omer Levy, Eran Yahav	ICLR	The ability to generate natural language sequences from source code snippets has a variety of applications such as code summarization, documentation, and retrieval. Sequence-to-sequence (seq2seq) models, adopted from neural machine translation (NMT), have achieved state-of-the-art performance on these tasks by treating source code as a sequence of tokens. We present code2seq: an alternative approach that leverages the syntactic structure of programming languages to better encode source code. Our model represents a code snippet as the set of compositional paths in its abstract syntax tree (AST) and uses attention to select the relevant paths while decoding. + + We demonstrate the effectiveness of our approach for two tasks, two programming languages, and four datasets of up to 16M examples. Our model significantly outperforms previous models that were specifically designed for programming languages, as well as general state-of-the-art NMT models. An interactive online demo of our model is available at http://code2seq.org. +	naming summarization representation
2019	Learning-based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection + + + + +	Lutz Büch, Artur Andrzejak	SANER	Code clone detection remains a crucial challenge in maintaining software projects. Many classic approaches rely on handcrafted aggregation schemes, while recent work uses supervised or unsupervised learning. In this work, we study several aspects of aggregation schemes for code clone detection based on supervised learning. To this aim, we implement an AST-based Recursive Neural Network. Firstly, our ablation study shows the influence of model choices and hyperparameters. We introduce error scaling as a way to effectively and efficiently address the class imbalance problem arising in code clone detection. Secondly, we study the influence of pretrained embeddings representing nodes in ASTs. We show that simply averaging all node vectors of a given AST yields strong baseline aggregation scheme. Further, learned AST aggregation schemes greatly benefit from pretrained node embeddings. Finally, we show the importance of carefully separating training and test data by clone clusters, to reliably measure generalization of models learned with supervision. +	grammar grammar clone
2019	JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation + + + + +	Rajas Agashe, Srinivasan Iyer, Luke Zettlemoyer		Interactive programming with interleaved code snippet cells and natural language markdown is recently gaining popularity in the form of Jupyter notebooks, which accelerate prototyping and collaboration. To study code generation conditioned on a long context history, we present JuICe, a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data. Using JuICe, we train models for two tasks: (1) generation of the API call sequence in a code cell, and (2) full code cell generation, both conditioned on the NL-Code history up to a particular code cell. Experiments using current baseline code generation models show that both context and distant supervision aid in generation, and that the dataset is challenging for current systems. +	dataset bimodal
2019	Natural Software Revisited + + + + +	Musfiqur Rahman, Dharani Palani, Peter C. Rigby	ICSE	Recent works have concluded that software is more repetitive and predictable, i.e. more natural, than English texts. These works included “simple/artificial” syntax rules in their language models. When we remove SyntaxTokens we find that code is still repetitive and predictable but only at levels slightly above English. Furthermore, previous works have compared individual Java programs to general English corpora, such as Gutenberg, which contains a historically large range of styles and subjects (e.g. Saint Augustine to Oscar Wilde). We perform an additional comparison of technical StackOverflow English discussions with source code and find that this restricted English is similarly repetitive to code. Although we find that code is less repetitive than previously thought, we suspect that API code element usage will be repetitive across software projects. For example a file is opened and closed in the same manner irrespective of domain. When we restrict our n-grams to those contained in the Java API we find that the entropy is significantly lower than the English corpora. Previous works have focused on sequential sequences of tokens. When we extract program graphs of size 2, 3, and 4 nodes we see that the abstract graph representation is much more concise and repetitive than the sequential representations of the same code. This suggests that future work should focus on statistical graph models that go beyond linear sequences of tokens. Our anonymous replication package makes our scripts and data available to future researchers and reviewers. +
2019	A Neural Model for Generating Natural Language Summaries of Program Subroutines + + + + +	Alexander LeClair, Siyuan Jiang, Collin McMillan	ICSE	Source code summarization – creating natural language descriptions of source code behavior – is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven approaches based on neural machine translation have largely overtaken template-based systems. But nearly all of these techniques rely almost entirely on programs having good internal documentation; without clear identifier names, the models fail to create good summaries. In this paper, we present a neural model that combines words from code with code structure from an AST. Unlike previous approaches, our model processes each data source as a separate input, which allows the model to learn code structure independent of the text in code. This process helps our approach provide coherent summaries in many cases even when zero internal documentation is provided. We evaluate our technique with a dataset we created from 2.1m Java methods. We find improvement over two baseline techniques from SE literature and one from NLP literature. +	summarization documentation
2019	Neural Code Search Evaluation Dataset + + + + +	Hongyu Li, Seohyun Kim, Satish Chandra		There has been an increase of interest in code search using natural language. Assessing the performance of such code search models can be difficult without a readily available evaluation suite. In this paper, we present an evaluation dataset consisting of natural language query and code snippet pairs, with the hope that future work in this area can use this dataset as a common benchmark. We also provide the results of two code search models ([1] and [6]) from recent work. +	dataset search
2019	CORE: Automating Review Recommendation for Code Changes + + + + +	JingKai Siow, Cuiyun Gao, Lingling Fan, Sen Chen, Yang Liu	SANER	Code review is a common process that is used by developers, in which a reviewer provides useful comments or points out defects in the submitted source code changes via pull request. Code review has been widely used for both industry and open-source projects due to its capacity in early defect identification, project maintenance, and code improvement. With rapid updates on project developments, code review becomes a non-trivial and labor-intensive task for reviewers. Thus, an automated code review engine can be beneficial and useful for project development in practice. Although there exist prior studies on automating the code review process by adopting static analysis tools or deep learning techniques, they often require external sources such as partial or full source code for accurate review suggestion. In this paper, we aim at automating the code review process only based on code changes and the corresponding reviews but with better performance. The hinge of accurate code review suggestion is to learn good representations for both code changes and reviews. To achieve this with limited source, we design a multi-level embedding (i.e., word embedding and character embedding) approach to represent the semantics provided by code changes and reviews. The embeddings are then well trained through a proposed attentional deep learning model, as a whole named CORE. We evaluate the effectiveness of CORE on code changes and reviews collected from 19 popular Java projects hosted on Github. Experimental results show that our model CORE can achieve significantly better performance than the state-of-the-art model (DeepMem), with an increase of 131.03% in terms of Recall@10 and 150.69% in terms of Mean Reciprocal Rank. Qualitative general word analysis among project developers also demonstrates the performance of CORE in automating code review. +	review
2019	Automatic Acquisition of Annotated Training Corpora for Test-Code Generation + + + + +	Magdalena Kacmajor, John D. Kelleher.	Information	Open software repositories make large amounts of source code publicly available. Potentially, this source code could be used as training data to develop new, machine learning-based programming tools. For many applications, however, raw code scraped from online repositories does not constitute an adequate training dataset. Building on the recent and rapid improvements in machine translation (MT), one possibly very interesting application is code generation from natural language descriptions. One of the bottlenecks in developing these MT-inspired systems is the acquisition of parallel text-code corpora required for training code-generative models. This paper addresses the problem of automatically synthetizing parallel text-code corpora in the software testing domain. Our approach is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular for unit testing. Therefore, we propose synthesizing parallel corpora comprised of parsed test function names serving as code descriptions, aligned with the corresponding function bodies. We present the results of applying one of the state-of-the-art MT methods on such a generated dataset. Our experiments show that a neural MT model trained on our dataset can generate syntactically correct and semantically relevant short Java functions from quasi-natural language descriptions of functionality. +
2019	Maybe Deep Neural Networks are the Best Choice for Modeling Source Code + + + + +	Rafael-Michael Karampatsis, Charles Sutton		Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the vocabulary to a fixed set of common words. For code, this strong assumption has been shown to have a significant negative effect on predictive performance. But the open vocabulary version of the neural network language models for code have not been introduced in the literature. We present a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names. We employ a segmentation into subword units, subsequences of tokens chosen based on a compression criterion, following previous work in machine translation. Our network achieves best in class performance, outperforming even the state-of-the-art methods of Hellendoorn and Devanbu that are designed specifically to model code. Furthermore, we present a simple method for dynamically adapting the model to a new test project, resulting in increased performance. We showcase our methodology on code corpora in three different languages of over a billion tokens each, hundreds of times larger than in previous work. To our knowledge, this is the largest neural language model for code that has been reported. +	language model
2019	Towards Neural Decompilation + + + + +	Omer Katz, Yuval Olshaker, Yoav Goldberg, Eran Yahav		We address the problem of automatic decompilation, converting a program in low-level representation back to a higher-level human-readable programming language. The problem of decompilation is extremely important for security researchers. Finding vulnerabilities and understanding how malware operates is much easier when done over source code. + + The importance of decompilation has motivated the construction of hand-crafted rule-based decompilers. Such decompilers have been designed by experts to detect specific control-flow structures and idioms in low-level code and lift them to source level. The cost of supporting additional languages or new language features in these models is very high. + + We present a novel approach to decompilation based on neural machine translation. The main idea is to automatically learn a decompiler from a given compiler. Given a compiler from a source language S to a target language T , our approach automatically trains a decompiler that can translate (decompile) T back to S . We used our framework to decompile both LLVM IR and x86 assembly to C code with high success rates. Using our LLVM and x86 instantiations, we were able to successfully decompile over 97% and 88% of our benchmarks respectively. +	decompilation
2019	On the Impact of Refactoring Operations on Code Naturalness + + + + +	Bin Lin, Csaba Nagy, Gabriele Bavota, Michele Lanza	SANER	Recent studies have demonstrated that software is natural, that is, its source code is highly repetitive and predictable like human languages. Also, previous studies suggested the existence of a relationship between code quality and its naturalness, presenting empirical evidence showing that buggy code is “less natural” than non-buggy code. We conjecture that this qualitynaturalness relationship could be exploited to support refactoring activities (e.g., to locate source code areas in need of refactoring). We perform a first step in this direction by analyzing whether refactoring can improve the naturalness of code. We use state-of-the-art tools to mine a large dataset of refactoring operations performed in open source systems. Then, we investigate the impact of different types of refactoring operations on the naturalness of the impacted code. We found that (i) code refactoring does not necessarily increase the naturalness of the refactored code; and (ii) the impact on the code naturalness strongly depends on the type of refactoring operations. +	language model refactoring
2019	TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing + + + + +	Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer		Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand a large amount of existing code before they can develop new features or fix bugs in existing programs. Being able to process programming language code automatically and provide summaries of code functionality accurately can significantly help developers to reduce time spent in code navigation and understanding, and thus increase productivity. Different from natural language articles, source code in programming languages often follows rigid syntactical structures and there can exist dependencies among code elements that are located far away from each other through complex control flows and data flows. Existing studies on tree-based convolutional neural networks (TBCNN) and gated graph neural networks (GGNN) are not able to capture essential semantic dependencies among code elements accurately. In this paper, we propose novel tree-based capsule networks (TreeCaps) and relevant techniques for processing program code in an automated way that encodes code syntactical structures and captures code dependencies more accurately. Based on evaluation on programs written in different programming languages, we show that our TreeCaps-based approach can outperform other approaches in classifying the functionalities of many programs. +	representation
2019	Neural query expansion for code search + + + + +	Jason Liu, Seohyun Kim, Vijayaraghavan Murali, Swarat Chaudhuri, Satish Chandra	MAPL	Searching repositories of existing source code for code snippets is a key task in software engineering. Over the years, many approaches to this problem have been proposed. One recent tool called NCS, takes in a natural language query and outputs relevant code snippets, often being able to correctly answer Stack Overflow questions. But what happens when the developer doesn’t provide a query with a clear intent? What if shorter queries are used to demonstrate a more vague intent? + + We find that the performance of NCS regresses with shorter queries. Furthermore, data from developers’ code search history logs shows that shorter queries have a less successful code search session: there are more query reformulations and more time is spent browsing the results. These observations lead us to believe that using NCS alone with short queries may not be productive enough. + + In this paper, we explore an additional way of using neural networks in code search: the automatic expansion of queries. We present NQE, a neural model that takes in a set of keywords and predicts a set of keywords to expand the query to NCS. NQE learns to predict keywords that co-occur with the query keywords in the underlying corpus, which helps expand the query in a productive way. Our results show that with query expansion, NQE + NCS is able to perform better than using NCS alone. +	search
2019	Learning Programmatic Idioms for Scalable Semantic Parsing + + + + +	Srinivasan Iyer, Alvin Cheung, Luke Zettlemoyer		Programmers typically organize executable source code using high-level coding patterns or idiomatic structures such as nested loops, exception handlers and recursive blocks, rather than as individual code tokens. In contrast, state of the art semantic parsers still map natural language instructions to source code by building the code syntax tree one node at a time. In this paper, we introduce an iterative method to extract code idioms from large source code corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax trees, and we train semantic parsers to apply these idioms during decoding. We apply this idiom-based code generation to a recent context-dependent semantic parsing task, and improve the state of the art by 2.2% BLEU score while reducing training time by more than 50%. This improved speed enables us to scale up the model by training on an extended training set that is 5x times larger, to further move up the state of the art by an additional 2.3% BLEU and 0.9% exact match. +	pattern mining code generation grammar
2019	Deep Transfer Learning for Source Code Modeling + + + + +	Yasir Hussain, Zhiqiu Huang, Yu Zhou, Senzhang Wang		In recent years, deep learning models have shown great potential in source code modeling and analysis. Generally, deep learning-based approaches are problem-specific and data-hungry. A challenging issue of these approaches is that they require training from starch for a different related problem. In this work, we propose a transfer learning-based approach that significantly improves the performance of deep learning-based source code models. In contrast to traditional learning paradigms, transfer learning can transfer the knowledge learned in solving one problem into another related problem. First, we present two recurrent neural network-based models RNN and GRU for the purpose of transfer learning in the domain of source code modeling. Next, via transfer learning, these pre-trained (RNN and GRU) models are used as feature extractors. Then, these extracted features are combined into attention learner for different downstream tasks. The attention learner leverages from the learned knowledge of pre-trained models and fine-tunes them for a specific downstream task. We evaluate the performance of the proposed approach with extensive experiments with the source code suggestion task. The results indicate that the proposed approach outperforms the state-of-the-art models in terms of accuracy, precision, recall, and F-measure without training the models from scratch. +	pretraining
2019	On Learning Meaningful Code Changes via Neural Machine Translation + + + + +	Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk	ICSE	Recent years have seen the rise of Deep Learning (DL) techniques applied to source code. Researchers have exploited DL to automate several development and maintenance tasks, such as writing commit messages, generating comments and detecting vulnerabilities among others. One of the long lasting dreams of applying DL to code is the possibility to automate non-trivial coding activities. While some steps in this direction have been taken (e.g., learning how to fix bugs), there is still a lack of empirical evidence on the types of code changes that can be learned and automatically applied by DL. Our goal is to make this first step by quantitatively and qualitatively investigating the ability of a Neural Machine Translation (NMT) model to learn how to automatically apply code changes implemented by developers during pull requests. We train and experiment with the NMT model on a set of 236k pairs of code components before and after the implementation of the changes provided in the pull requests. We show that, when applied in a narrow enough context (i.e., small/medium-sized pairs of methods before/after the pull request changes), NMT can automatically replicate the changes implemented by developers during pull requests in up to 36% of the cases. Moreover, our qualitative analysis shows that the model is capable of learning and replicating a wide variety of meaningful code changes, especially refactorings and bug-fixing activities. Our results pave the way to novel research in the area of DL on code, such as the automatic learning and applications of refactoring. +	repair edit
2019	Program Classification Using Gated Graph Attention Neural Network for Online Programming Service + + + + +	Mingming Lu, Dingwu Tan, Naixue Xiong, Zailiang Chen, Haifeng Li		The online programing services, such as Github, TopCoder, and EduCoder, have promoted a lot of social interactions among the service users. However, the existing social interactions is rather limited and inefficient due to the rapid increasing of source-code repositories, which is difficult to explore manually. The emergence of source-code mining provides a promising way to analyze those source codes, so that those source codes can be relatively easy to understand and share among those service users. Among all the source-code mining attempts,program classification lays a foundation for various tasks related to source-code understanding, because it is impossible for a machine to understand a computer program if it cannot classify the program correctly. Although numerous machine learning models, such as the Natural Language Processing (NLP) based models and the Abstract Syntax Tree (AST) based models, have been proposed to classify computer programs based on their corresponding source codes, the existing works cannot fully characterize the source codes from the perspective of both the syntax and semantic information. To address this problem, we proposed a Graph Neural Network (GNN) based model, which integrates data flow and function call information to the AST,and applies an improved GNN model to the integrated graph, so as to achieve the state-of-art program classification accuracy. The experiment results have shown that the proposed work can classify programs with accuracy over 97%. +	GNN representation
2019	Mercem: Method Name Recommendation Based on Call Graph Embedding + + + + +	Hiroshi Yonai, Yasuhiro Hayase, Hiroyuki Kitagawa		Comprehensibility of source code is strongly affected by identifier names, therefore software developers need to give good (e.g. meaningful but short) names to identifiers. On the other hand, giving a good name is sometimes a difficult and time-consuming task even for experienced developers. To support naming identifiers, several techniques for recommending identifier name candidates have been proposed. These techniques, however, still have challenges on the goodness of suggested candidates and limitations on applicable situations. This paper proposes a new approach to recommending method names by applying graph embedding techniques to the method call graph. The evaluation experiment confirms that the proposed technique can suggest more appropriate method name candidates in difficult situations than the state of the art approach. +	naming representation refactoring
2019	CodeSearchNet Challenge: Evaluating the State of Semantic Code Search + + + + +	Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, Marc Brockschmidt		Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. + + To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task. + + We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future. +	dataset search
2019	Inferring Javascript types using Graph Neural Networks + + + + +	Jessica Schrouff, Kai Wohlfahrt, Bruno Marnette, Liam Atkinson	Representation Learning on Graphs and Manifolds ICLR 2019 workshop	The recent use of `Big Code’ with state-of-the-art deep learning methods offers promising avenues to ease program source code writing and correction. As a first step towards automatic code repair, we implemented a graph neural network model that predicts token types for Javascript programs. The predictions achieve an accuracy above 90%, which improves on previous similar work. +	GNN types program analysis
2019	Improving Bug Detection via Context-Based Code Representation Learning and Attention-Based Neural Networks + + + + +	Yi Li, Shaohua Wang, Tien N. Nguyen, Son Van Nguyen	OOPSLA	Bug detection has been shown to be an effective way to help developers in detecting bugs early, thus, saving much effort and time in software development process. Recently, deep learning-based bug detection approaches have gained successes over the traditional machine learning-based approaches, the rule-based program analysis approaches, and mining-based approaches. However, they are still limited in detecting bugs that involve multiple methods and suffer high rate of false positives. In this paper, we propose a combination approach with the use of contexts and attention neural network to overcome those limitations. We propose to use as the global context the Program Dependence Graph (PDG) and Data Flow Graph (DFG) to connect the method under investigation with the other relevant methods that might contribute to the buggy code. The global context is complemented by the local context extracted from the path on the AST built from the method’s body. The use of PDG and DFG enables our model to reduce the false positive rate, while to complement for the potential reduction in recall, we make use of the attention neural network mechanism to put more weights on the buggy paths in the source code. That is, the paths that are similar to the buggy paths will be ranked higher, thus, improving the recall of our model. We have conducted several experiments to evaluate our approach on a very large dataset with +4.973M methods in 92 different project versions. The results show that our tool can have a relative improvement up to 160% on F-score when comparing with the state-of-the-art bug detection approaches. Our tool can detect 48 true bugs in the list of top 100 reported bugs, which is 24 more true bugs when comparing with the baseline approaches. We also reported that our representation is better suitable for bug detection and relatively improves over the other representations up to 206% in accuracy. +	representation defect
2019	CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning + + + + +	Ziyu Yao, Jayavardhan Reddy Peddamail, Huan Sun		To accelerate software development, much research has been performed +to help people understand and reuse the huge amount of available code +resources. Two important tasks have been widely studied: code retrieval, +which aims to retrieve code snippets relevant to a given natural language +query from a code base, and code annotation, where the goal is to annotate a +code snippet with anatural language description. Despite their advancement in recent +years, the two tasks are mostly explored separately. In this work, we +investigate a novel perspective of Code annotation for Code retrieval +(hence called “CoaCor”), where a code annotation model is trained +to generate a natural language annotation that can represent the +semantic meaning of a given code snippet and can be leveraged by +a code retrieval model to better distinguish relevant code snippets +from others. To this end, we propose an effective framework based +on reinforcement learning, which explicitly encourages the code +annotation model to generate annotations that can be used for the +retrieval task. Through extensive experiments, we show that code +annotations generated by our framework are much more detailed +and more useful for code retrieval, and they can further improve +the performance of existing code retrieval models significantly. +	search
2019	SPoC: Search-based Pseudocode to Code + + + + +	Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, Percy S. Liang		We consider the task of mapping pseudocode to long programs that are functionally correct. Given test cases as a mechanism to validate programs, we search over the space of possible translations of the pseudocode to find a program that passes the validation. However, without proper credit assignment to localize the sources of program failures, it is difficult to guide search toward more promising programs. We propose to perform credit assignment based on signals from compilation errors, which constitute 88.7% of program failures. Concretely, we treat the translation of each pseudocode line as a discrete portion of the program, and whenever a synthesized program fails to compile, an error localization method tries to identify the portion of the program responsible for the failure. We then focus search over alternative translations of the pseudocode for those portions. For evaluation, we collected the SPoC dataset (Search-based Pseudocode to Code) containing 18,356 programs with human-authored pseudocode and test cases. Under a budget of 100 program compilations, performing search improves the synthesis success rate over using the top-one translation of the pseudocode from 25.6% to 44.7%. +	bimodal synthesis
2019	A Neural Approach to Decompiled Identifier Renaming + + + + +	Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, Bogdan Vasilescu	ASE	The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. However, compilation loses information contained within the original source code (e.g. structure, type information, and variable names). Semantically meaningful variable names are known to increase code understandability, but they generally cannot be recovered by decompilers. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GitHub. Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3% of the time. +	deobfuscation naming compilation
2019	PathMiner : A Library for Mining of Path-Based Representations of Code + + + + +	Vladimir Kovalenko, Egor Bogomolov, Timofey Bryksin, Alberto Bacchelli.	MSR	One recent, significant advance in modeling source code for machine learning algorithms has been the introduction of path-based representation – an approach consisting in representing a snippet of code as a collection of paths from its syntax tree. Such representation efficiently captures the structure of code, which, in turn, carries its semantics and other information. +Building the path-based representation involves parsing the code and extracting the paths from its syntax tree; these steps build up to a substantial technical job. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from the essential work and hinders newcomers in the field of machine learning on code. + + In this paper, we present PathMiner – an open-source library for mining path-based representations of code. PathMiner is fast, flexible, well-tested, and easily extensible to support input code in any common programming language. Preprint [https://doi.org/10.5281/zenodo.2595271]; released tool [https://doi.org/10.5281/zenodo.2595257]. +	representation grammar
2019	Learning to Represent Edits + + + + +	Pengcheng Yin, Graham Neubig, Miltiadis Allamanis, Marc Brockschmidt, Alexander L. Gaunt	ICLR	We introduce the problem of learning distributed representations of edits. By combining a +“neural editor” with an “edit encoder”, our models learn to represent the salient +information of an edit and can be used to apply edits to new inputs. +We experiment on natural language and source code edit data. Our evaluation yields +promising results that suggest that our neural network models learn to capture +the structure and semantics of edits. We hope that this interesting task and +data source will inspire other researchers to work further on this problem. +	edit
2019	A Grammar-Based Structural CNN Decoder for Code Generation + + + + +	Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong, Ge Li, Lu Zhang	AAAI	Code generation maps a program description to executable +source code in a programming language. Existing approaches +mainly rely on a recurrent neural network (RNN) as the decoder. However, we find that a program contains significantly +more tokens than a natural language sentence, and thus it may +be inappropriate for RNN to capture such a long sequence. In +this paper, we propose a grammar-based structural convolutional neural network (CNN) for code generation. Our model +generates a program by predicting the grammar rules of the +programming language; we design several CNN modules, including the tree-based convolution and pre-order convolution, +whose information is further aggregated by dedicated attentive pooling layers. Experimental results on the HearthStone +benchmark dataset show that our CNN code generator significantly outperforms the previous state-of-the-art method by 5 +percentage points; additional experiments on several semantic parsing tasks demonstrate the robustness of our model. We +also conduct in-depth ablation test to better understand each +component of our model. +	code generation grammar
2019	NEUZZ: Efficient Fuzzing with Neural Program Smoothing + + + + +	Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, Suman Jana	IEEE S&P	Fuzzing has become the de facto standard technique for finding software vulnerabilities. However, even state-of-the-art fuzzers are not very efficient at finding hard-to-trigger software bugs. Most popular fuzzers use evolutionary guidance to generate inputs that can trigger different bugs. Such evolutionary algorithms, while fast and simple to implement, often get stuck in fruitless sequences of random mutations. Gradient-guided optimization presents a promising alternative to evolutionary guidance. Gradient-guided techniques have been shown to significantly outperform evolutionary algorithms at solving high-dimensional structured optimization problems in domains like machine learning by efficiently utilizing gradients or higher-order derivatives of the underlying function. However, gradient-guided approaches are not directly applicable to fuzzing as real-world program behaviors contain many discontinuities, plateaus, and ridges where the gradient-based methods often get stuck. We observe that this problem can be addressed by creating a smooth surrogate function approximating the discrete branching behavior of target program. In this paper, we propose a novel program smoothing technique using surrogate neural network models that can incrementally learn smooth approximations of a complex, real-world program’s branching behaviors. We further demonstrate that such neural network models can be used together with gradient-guided input generation schemes to significantly improve the fuzzing efficiency. Our extensive evaluations demonstrate that NEUZZ significantly outperforms 10 state-of-the-art graybox fuzzers on 10 real-world programs both at finding new bugs and achieving higher edge coverage. NEUZZ found 31 unknown bugs that other fuzzers failed to find in 10 real world programs and achieved 3X more edge coverage than all of the tested graybox fuzzers for 24 hours running. +	fuzzing
2019	Recommendations for Datasets for Source Code Summarization + + + + +	Alexander LeClair, Collin McMillan	NAACL 2019	Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable datasets. In addition, a lack of community standards for creating datasets leads to confusing and unreproducible research results – we observe swings in performance of more than 33% due only to changes in dataset design. In this paper, we make recommendations for these standards from experimental results. We release a dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects. We describe the dataset and point out key differences from natural language data, to guide and support future researchers. +	summarization dataset
2018	Neural-Machine-Translation-Based Commit Message Generation: How Far Are We? + + + + +	Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, Xinyu Wang	ASE	Commit messages can be regarded as the documentation of software changes. These messages describe the content and purposes of changes, hence are useful for program comprehension and software maintenance. However, due to the lack of time and direct motivation, commit messages sometimes are neglected by developers. To address this problem, Jiang et al. proposed an approach (we refer to it as NMT), which leverages a neural machine translation algorithm to automatically generate short commit messages from code. The reported performance of their approach is promising, however, they did not explore why their approach performs well. Thus, in this paper, we first perform an in-depth analysis of their experimental results. We find that (1) Most of the test <pre>diffs</pre> from which NMT can generate high-quality messages are similar to one or more training <pre>diffs</pre> at the token level. (2) About 16% of the commit messages in Jiang et al.’s dataset are noisy due to being automatically generated or due to them describing repetitive trivial changes. (3) The performance of NMT declines by a large amount after removing such noisy commit messages. In addition, NMT is complicated and time-consuming. Inspired by our first finding, we proposed a simpler and faster approach, named NNGen (Nearest Neighbor Generator), to generate concise commit messages using the nearest neighbor algorithm. Our experimental results show that NNGen is over 2,600 times faster than NMT, and outperforms NMT in terms of BLEU (an accuracy measure that is widely used to evaluate machine translation systems) by 21%. Finally, we also discuss some observations for the road ahead for automated commit message generation to inspire other researchers. +	edit summarization
2018	Exploring the Naturalness of Buggy Code with Recurrent Neural Network + + + + +	Jack Lanchantin, Ji Gao		Statistical language models are powerful tools +which have been used for many tasks within natural language processing. Recently, they have been +used for other sequential data such as source code. +(Ray et al., 2015) showed that it is possible train an +n-gram +source code language mode, and use it to +predict buggy lines in code by determining “unnatural” lines via entropy with respect to the language +model. In this work, we propose using a more advanced language modeling technique, Long Short-term Memory recurrent neural networks, to model +source code and classify buggy lines based on entropy. We show that our method slightly outperforms an +n-gram model in the buggy line classification task using AUC +	language model defect
2018	CODIT: Code Editing with Tree-Based Neural Machine Translation + + + + +	Saikat Chakraborty, Miltiadis Allamanis, Baishakhi Ray		The way developers edit day-to-day code tends to be repetitive, often using existing code elements. Many researchers have tried to automate repetitive code changes by learning from specific change templates which are applied to limited scope. The advancement of Neural Machine Translation (NMT) and the availability of vast open-source evolutionary data opens up the possibility of automatically learning those templates from the wild. However, unlike natural languages, for which NMT techniques were originally devised, source code and its changes have certain properties. For instance, compared to natural language, source code vocabulary can be significantly larger. Further, good changes in code do not break its syntactic structure. Thus, deploying state-of-the-art NMT models without adapting the methods to the source code domain yields sub-optimal results. To this end, we propose a novel Tree based NMT system to model source code changes and learn code change patterns from the wild. We realize our model with a change suggestion engine: CODIT and train the model with more than 30k real-world changes and evaluate it on 6k patches. Our evaluation shows the effectiveness of CODIT in learning and suggesting patches.CODIT also shows promise generating bug fix patches. +	grammar grammar repair code generation
2018	Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks + + + + +	Nghi D. Q. Bui, Lingxiao Jiang, Yijun Yu	NLSE	Towards the vision of translating code that implements an algorithm from one programming language into another, this +paper proposes an approach for automated program classification using +bilateral tree-based convolutional neural networks +(BiTBCNNs). It is layered on top of two tree-based +convolutional neural networks (TBCNNs), each of which recognizes the algorithm of code written in an individual programming language. The combination layer of the networks +recognizes the similarities and differences among code in different programming languages. The BiTBCNNs are trained +using the source code in different languages but known to +implement the same algorithms and/or functionalities. For +a preliminary evaluation, we use 3591 Java and 3534 C++ +code snippets from 6 algorithms we crawled systematically +from GitHub. We obtained over 90% accuracy in the cross-language binary classification task to tell whether any given +two code snippets implement a same algorithm. Also, for the +algorithm classification task, i.e., to predict which one of the +six algorithm labels is implemented by an arbitrary C++ code +snippet, we achieved over 80% precision. +	representation grammar
2018	Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification + + + + +	Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang	SANER	Algorithm classification is to automatically identify +the classes of a program based on the algorithm(s) and/or data +structure(s) implemented in the program. It can be useful for +various tasks, such as code reuse, code theft detection, and malware detection. Code similarity metrics, on the basis of features +extracted from syntax and semantics, have been used to classify +programs. Such features, however, often need manual selection +effort and are specific to individual programming languages, +limiting the classifiers to programs in the same language. +To recognize the similarities and differences among algorithms +implemented in different languages, this paper describes a +framework of Bilateral Neural Networks (Bi-NN) that builds a +neural network on top of two underlying sub-networks, each of +which encodes syntax and semantics of code in one language. A +whole Bi-NN can be trained with bilateral programs that implement the same algorithms and/or data structures in different +languages and then be applied to recognize algorithm classes +across languages. + + We have instantiated the framework with several kinds of +token-, tree- and graph-based neural networks that encode and +learn various kinds of information in code. We have applied +the instances of the framework to a code corpus collected from +GitHub containing thousands of Java and C++ programs imple- +menting 50 different algorithms and data structures. Our evalua- +tion results show that the use of Bi-NN indeed produces promising +algorithm classification results both within one language and +across languages, and the encoding of dependencies from code +into the underlying neural networks helps improve algorithm +classification accuracy further. In particular, our custom-built +dependency trees with tree-based convolutional neural networks +achieve the highest classification accuracy among the different +instances of the framework that we have evaluated. Our study +points to a possible future research direction to tailor bilateral +and multilateral neural networks that encode more relevant +semantics for code learning, mining and analysis tasks +	representation
2018	Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code + + + + +	Nghi D. Q. Bui, Lingxiao Jiang	ICSE	Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach to automatically learn cross-language representations for various kinds of structural code elements that may be used for program translation. Our key idea is two folded: First, we normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings; Second, hierarchically from bottom up, we construct shared embeddings for code elements of higher levels of granularity (e.g., expressions, statements, methods) from the embeddings for their constituents, and then build mappings among code elements across languages based on similarities among embeddings. +Our preliminary evaluations on about 40,000 Java and C# source files from 9 software projects show that our approach can automatically learn shared embeddings for various code elements in different languages and identify their cross-language mappings with reasonable Mean Average Precision scores. When compared with an existing tool for mapping library API methods, our approach identifies many more mappings accurately. The mapping results and code can be accessed at this https URL. We believe that our idea for learning cross-language vector representations with code structural information can be a useful step towards automated program translation. +	representation
2018	Deep Learning Type Inference + + + + +	V. J. Hellendoorn, Christian Bird, Earl T. Barr, Miltiadis Allamanis	FSE	Dynamically typed languages such as JavaScript and Python are +increasingly popular, yet static typing has not been totally eclipsed: +Python now supports type annotations and languages like TypeScript offer a middle-ground for JavaScript: a strict superset of +JavaScript, to which it transpiles, coupled with a type system that +permits partially typed programs. However, static typing has a cost: +adding annotations, reading the added syntax, and wrestling with +the type system to fix type errors. Type inference can ease the +transition to more statically typed code and unlock the benefits of +richer compile-time information, but is limited in languages like +JavaScript as it cannot soundly handle duck-typing or runtime evaluation +via eval. We propose DeepTyper, a deep learning model +that understands which types naturally occur in certain contexts +and relations and can provide type suggestions, which can often +be verified by the type checker, even if it could not infer the type +initially. DeepTyper, leverages an automatically aligned corpus +of tokens and types to accurately predict thousands of variable +and function type annotations. Furthermore, we demonstrate that +context is key in accurately assigning these types and introduce a +technique to reduce overfitting on local cues while highlighting the +need for further improvements. Finally, we show that our model +can interact with a compiler to provide more than 4,000 additional +type annotations with over 95% precision that could not be inferred +without the aid of DeepTyper. +	representation types
2018	Mapping Language to Code in Programmatic Context + + + + +	Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer	EMNLP	Source code is rarely written in isolation. It depends significantly on the programmatic context, such as the class that the code would reside in. To study this phenomenon, we introduce the task of generating class member functions given English documentation and the programmatic context provided by the rest of the class. This task is challenging because the desired code can vary greatly depending on the functionality the class provides (e.g., a sort function may or may not be available when we are asked to “return the smallest element” in a particular member variable list). We introduce CONCODE, a new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment. We also present a detailed error analysis suggesting that there is significant room for future work on this task. +	bimodal code generation
2018	Compiler Fuzzing through Deep Learning + + + + +	Chris Cummins, Pavlos Petoumenos, Alastair Murray, Hugh Leather	ISSTA	Random program generation — fuzzing — is an effective technique +for discovering bugs in compilers but successful fuzzers require +extensive development effort for every language supported by the +compiler, and often leave parts of the language space untested. + + We introduce DeepSmith, a novel machine learning approach +to accelerating compiler validation through the inference of generative models for compiler inputs. Our approach +infers a learned +model of the structure of real world code based on a large corpus of open source code. Then, it uses the model to automatically +generate tens of thousands of realistic programs. Finally, we apply +established differential testing methodologies on them to expose +bugs in compilers. We apply our approach to the OpenCL programming language, automatically exposing bugs with little effort on our +side. In 1,000 hours of automated testing of commercial and open +source compilers, we discover bugs in all of them, submitting 67 +bug reports. Our test cases are on average two orders of magnitude +smaller than the state-of-the-art, require 3.03× less time to generate +and evaluate, and expose bugs which the state-of-the-art cannot. +Our random program generator, comprising only 500 lines of code, +took 12 hours to train for OpenCL versus the state-of-the-art taking +9 man months to port from a generator for C and 50,000 lines of +code. With 18 lines of code we extended our program generator to +a second language, uncovering crashes in Solidity compilers in 12 +hours of automated testing. +	fuzzing code generation
2018	Deep Learning to Detect Redundant Method Comments + + + + +	Annie Louis, Santanu Kumar Dash, Earl T. Barr, Charles Sutton		Comments in software are critical for maintenance and reuse. But apart from prescriptive advice, there is little practical support or quantitative understanding of what makes a comment useful. In this paper, we introduce the task of identifying comments which are uninformative about the code they are meant to document. To address this problem, we introduce the notion of comment entailment from code, high entailment indicating that a comment’s natural language semantics can be inferred directly from the code. Although not all entailed comments are low quality, comments that are too easily inferred, for example, comments that restate the code, are widely discouraged by authorities on software style. Based on this, we develop a tool called CRAIC which scores method-level comments for redundancy. Highly redundant comments can then be expanded or alternately removed by the developer. CRAIC uses deep language models to exploit large software corpora without requiring expensive manual annotations of entailment. We show that CRAIC can perform the comment entailment task with good agreement with human judgements. Our findings also have implications for documentation tools. For example, we find that common tags in Javadoc are at least two times more predictable from code than non-Javadoc sentences, suggesting that Javadoc tags are less informative than more free-form comments +	bimodal documentation
2018	Content Aware Source Code Change Description Generation + + + + +	Pablo Loyola, Edison Marrese-Taylor, Jorge Balazs, Yutaka Matsuo, Fumiko Satoh	International Natural Language Generation Conference	We propose to study the generation of descriptions from source code changes by integrating the messages included on code +commits and the intra-code documentation +inside the source in the form of docstrings. +Our hypothesis is that although both types +of descriptions are not directly aligned in +semantic terms —one explaining a change +and the other the actual functionality of +the code being modified— there could be +certain common ground that is useful for +the generation. To this end, we propose +an architecture that uses the source code-docstring relationship to guide the description generation. We discuss the results of +the approach comparing against a baseline +based on a sequence-to-sequence model, +using standard automatic natural language +generation metrics as well as with a human +study, thus offering a comprehensive view +of the feasibility of the approach. +	edit summarization
2018	User-guided program reasoning using Bayesian inference + + + + +	Mukund Raghothaman, Sulekha Kulkarni, Kihong Heo, Mayur Naik	PLDI	Program analyses necessarily make approximations that often lead them to report true alarms interspersed with many false alarms. We propose a new approach to leverage user feedback to guide program analyses towards true alarms and away from false alarms. Our approach associates each alarm with a confidence value by performing Bayesian inference on a probabilistic model derived from the analysis rules. In each iteration, the user inspects the alarm with the highest confidence and labels its ground truth, and the approach recomputes the confidences of the remaining alarms given this feedback. It thereby maximizes the return on the effort by the user in inspecting each alarm. We have implemented our approach in a tool named Bingo for program analyses expressed in Datalog. Experiments with real users and two sophisticated analyses—a static datarace analysis for Java programs and a static taint analysis for Android apps—show significant improvements on a range of metrics, including false alarm rates and number of bugs found. +	program analysis
2018	Neuro-symbolic program corrector for introductory programming assignments + + + + +	Sahil Bhatia, Pushmeet Kohli, Rishabh Singh	ICSE	Automatic correction of programs is a challenging problem with numerous real world applications in security, verification, and education. One application that is becoming increasingly important is the correction of student submissions in online courses for providing feedback. Most existing program repair techniques analyze Abstract Syntax Trees (ASTs) of programs, which are unfortunately unavailable for programs with syntax errors. In this paper, we propose a novel Neuro-symbolic approach that combines neural networks with constraint-based reasoning. Specifically, our method first uses a Recurrent Neural Network (RNN) to perform syntax repairs for the buggy programs; subsequently, the resulting syntactically-fixed programs are repaired using constraint-based techniques to ensure functional correctness. The RNNs are trained using a corpus of syntactically correct submissions for a given programming assignment, and are then queried to fix syntax errors in an incorrect programming submission by replacing or inserting the predicted tokens at the error location. We evaluate our technique on a dataset comprising of over 14,500 student submissions with syntax errors. Our method is able to repair syntax errors in 60% (8689) of submissions, and finds functionally correct repairs for 23.8% (3455) submissions. +	repair
2018	Neural Code Comprehension: A Learnable Representation of Code Semantics + + + + +	Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler	NeurIPS	With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that with a single RNN architecture and pre-trained fixed embeddings, inst2vec outperforms specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art. +	representation
2018	Neural-Augumented Static Analysis of Android Communication + + + + +	Jinman Zhao, Aws Albarghouthi, Vaibhav Rastogi, Somesh Jha, Damien Octeau	FSE	We address the problem of discovering communication links between applications in the popular Android mobile operating system, an important problem for security and privacy in Android. Any scalable static analysis in this complex setting is bound to produce an excessive amount of false-positives, rendering it impractical. To improve precision, we propose to augment static analysis with a trained neural-network model that estimates the probability that a communication link truly exists. We describe a neural-network architecture that encodes abstractions of communicating objects in two applications and estimates the probability with which a link indeed exists. At the heart of our architecture are type-directed encoders (TDE), a general framework for elegantly constructing encoders of a compound data type by recursively composing encoders for its constituent types. We evaluate our approach on a large corpus of Android applications, and demonstrate that it achieves very high accuracy. Further, we conduct thorough interpretability studies to understand the internals of the learned neural networks. +	program analysis
2018	Evaluation of Type Inference with Textual Cues + + + + +	Amirreza A. Shirani, A. Pastor Lopez-Monroy, Fabio Gonzalez, Thamar Solorio, Mohammad Amin Alipour	NLSE	Type information plays an important role in the success of information retrieval and recommendation systems in software +engineering. Thus, the absence of types in dynamically-typed +languages poses a challenge to adapt these systems to support +dynamic languages. + + In this paper, we explore the viability of type inference using +textual cues. That is, we formulate the type inference problem as a classification problem which uses the textual features +in the source code to predict the type of variables. In this +approach, a classifier learns a model to distinguish between +types of variables in a program. The model is subsequently +used to (approximately) infer the types of other variables. + + We evaluate the feasibility of this approach on four Java +projects wherein type information is already available in the +source code and can be used to train and test a classifier. Our +experiments show this approach can predict the type of new +variables with relatively high accuracy (80% F-measure). +These results suggest that textual cues can be +complementary +tools in inferring types for dynamic languages. +	information extraction
2018	Improving Automatic Source Code Summarization via Deep Reinforcement Learning + + + + +	Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, Philip S. Yu	ASE	Code summarization provides a high level natural language description of the function performed by code, as it can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization; b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given. However, it is expected to generate the entire sequence from scratch at test time. This discrepancy can cause an exposure bias issue, making the learnt decoder suboptimal. In this paper, we incorporate an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network). The actor network provides the confidence of predicting the next word according to current state. On the other hand, the critic network evaluates the reward value of all possible extensions of the current state and can provide global guidance for explorations. We employ an advantage reward composed of BLEU metric to train both networks. Comprehensive experiments on a real-world dataset show the effectiveness of our proposed model when compared with some state-of-the-art methods. +	summarization documentation
2018	Generating Regular Expressions from Natural Language Specifications: Are We There Yet? + + + + +	Zexuan Zhong, Jiaqi Guo, Wei Yang, Tao Xie, Jian-Guang Lou, Ting Liu, Dongmei Zhang	NLSE	Recent state-of-the-art approaches automatically generate +regular expressions from natural language specifications. +Given that these approaches use only synthetic data in both +training datasets and validation/test datasets, a natural question arises: are these approaches effective to address various +real-world situations? To explore this question, in this paper, we conduct a characteristic study on comparing two synthetic datasets used by the recent research and a real-world +dataset collected from the Internet, and conduct an experimental study on applying a state-of-the-art approach on the +real-world dataset. Our study results suggest the existence of +distinct characteristics between the synthetic datasets and the +real-world dataset, and the state-of-the-art approach (based +on a model trained from a synthetic dataset) achieves extremely low effectiveness when evaluated on real-world data, +much lower than the effectiveness when evaluated on the synthetic dataset. We also provide initial analysis on some of +those challenging cases and discuss future directions. +	bimodal code generation
2018	A General Path-Based Representation for Predicting Program Properties + + + + +	Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav	PLDI	Predicting program properties such as names or expression types has a wide range of applications. It can ease the task of programming and increase programmer productivity. A major challenge when learning from programs is how to represent programs in a way that facilitates effective learning. +We present a general path-based representation for learning from programs. Our representation is purely syntactic and extracted automatically. The main idea is to represent a program using paths in its abstract syntax tree (AST). This allows a learning model to leverage the structured nature of code rather than treating it as a flat sequence of tokens. +We show that this representation is general and can: (i) cover different prediction tasks, (ii) drive different learning algorithms (for both generative and discriminative models), and (iii) work across different programming languages. +We evaluate our approach on the tasks of predicting variable names, method names, and full types. We use our representation to drive both CRF-based and word2vec-based learning, for programs of four languages: JavaScript, Java, Python and C#. Our evaluation shows that our approach obtains better results than task-specific handcrafted representations across different tasks and programming languages. +	naming representation
2018	Automated Vulnerability Detection in Source Code Using Deep Representation Learning + + + + +	Rebecca L. Russell, Louis Kim, Lei H. Hamilton, Tomo Lazovich, Jacob A. Harer, Onur Ozdemir, Paul M. Ellingwood, Marc W. McConley		Increasing numbers of software vulnerabilities are discovered every year whether they are reported publicly or discovered internally in proprietary code. These vulnerabilities can pose serious risk of exploit and result in system compromise, information leaks, or denial of service. We leveraged the wealth of C and C++ open-source code available to develop a large-scale function-level vulnerability detection system using machine learning. To supplement existing labeled vulnerability datasets, we compiled a vast dataset of millions of open-source functions and labeled it with carefully-selected findings from three different static analyzers that indicate potential exploits. Using these datasets, we developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code. We evaluated our tool on code from both real software packages and the NIST SATE IV benchmark dataset. Our results demonstrate that deep feature representation learning on source code is a promising approach for automated software vulnerability detection. +	program analysis
2018	Polyglot Semantic Parsing in APIs + + + + +	Kyle Richardson, Jonathan Berant, Jonas Kuhn	NAACL	Traditional approaches to semantic parsing (SP) work by training individual models for each available parallel dataset of text-meaning pairs. In this paper, we explore the idea of polyglot semantic translation, or learning semantic parsing models that are trained on multiple datasets and natural languages. In particular, we focus on translating text to code signature representations using the software component datasets of Richardson and Kuhn (2017a,b). The advantage of such models is that they can be used for parsing a wide variety of input natural languages and output programming languages, or mixed input languages, using a single unified model. To facilitate modeling of this type, we develop a novel graph-based decoding framework that achieves state-of-the-art performance on the above datasets, and apply this method to two other benchmark SP tasks. +	bimodal API
2018	Learning to Represent Programs with Graphs + + + + +	Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi	ICLR	Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried to transfer natural language methods and does not capitalize on the unique opportunities offered by code’s known syntax. For example, long-range dependencies induced by using the same variable or function in distant locations are often not considered. We propose to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures. + + In this work, we present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to such large graphs. We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage, and VarMisuse, in which the network learns to reason about selecting the correct variable that should be used at a given program location. Our comparison to methods that use less structured program representations shows the advantages of modeling known structure, and suggests that our models learn to infer meaningful names and to solve the VarMisuse task in many cases. Additionally, our testing showed that VarMisuse identifies a number of bugs in mature open-source projects. +	naming GNN representation variable misuse defect
2018	StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow + + + + +	Ziyu Yao, Daniel S. Weld, Wei-Peng Chen, Huan Sun	WWW 2018	Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1 and accuracy. Furthermore, we present StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of ∼148K Python and ∼120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language. +	dataset
2018	Learning How to Mutate Source Code from Bug-Fixes + + + + +	Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk		Mutation testing has been widely accepted as an approach to guide test case generation or to assess the effectiveness of test suites. Empirical studies have shown that mutants are representative of real faults; yet they also indicated a clear need for better, possibly customized, mutation operators and strategies. While some recent papers have tried to devise domain-specific or general purpose mutator operators by manually analyzing real faults, such an activity is effort- (and error-) prone and does not deal with an important practical question as to how to really mutate a given source code element. We propose a novel approach to automatically learn mutants from faults in real programs. First, our approach processes bug fixing changes using fine-grained differencing, code abstraction, and change clustering. Then, it learns mutation models using a deep learning strategy. We have trained and evaluated our technique on a set of ~787k bugs mined from GitHub. Starting from code fixed by developers in the context of a bug-fix, our empirical evaluation showed that our models are able to predict mutants that resemble original fixed bugs in between 9% and 45% of the cases (depending on the model). Moreover, over 98% of the automatically generated mutants are lexically and syntactically correct. +	repair edit
2018	NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System + + + + +	Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst	LREC	We present new data and semantic parsing methods for the problem of mapping english sentences to Bash commands (NL2Bash). Our long-term goal is to enable any user to easily solve otherwise repetitive tasks (such as file manipulation, search, and application-specific scripting) by simply stating their intents in English. We take a first step in this domain, by providing a large new dataset of challenging but commonly used commands paired with their English descriptions, along with the baseline methods to establish performance levels on this task. +	bimodal code generation
2018	Oreo: detection of clones in the twilight zone + + + + +	Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, Cristina Lopes	ESEC/FSE	Source code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). Most clone detectors reported in the literature work well up to Type-3, which accounts for syntactic differences. In between Type-3 and Type-4, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect – the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. We present Oreo, a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones accurately, but is also capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. We evaluate the recall of Oreo on BigCloneBench, and perform manual evaluation for precision. Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner. +	clone
2018	Bayesian Sketch Learning for Program Synthesis + + + + +	Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, Chris Jermaine	ICLR	We present a Bayesian statistical approach to the problem of automatic program synthesis. Our synthesizer starts +by learning, offline and from an existing corpus, a probabilistic model of real-world programs. During synthesis, +it is provided some ambiguous and incomplete evidence about the nature of the programming task that the user +wants automated, for example sets of API calls or data types that are relevant for the task. Given this input, the +synthesizer infers a posterior distribution over type-safe programs that assigns higher likelihood to programs +that, according to the learned model, are more likely to match the evidence. + + We realize this approach using two key ideas. First, our learning techniques operate not over code but +syntactic abstractions, or sketches, of programs. During synthesis, we infer a posterior distribution over sketches, +then concretize samples from this distribution into type-safe programs using combinatorial techniques. Second, +our statistical model explicitly models the full intent behind a synthesis task as a latent variable. To infer +sketches, we first estimate a posterior distribution on the intent, then use samples from this posterior to generate +a distribution over possible sketches. We show that our model can be implemented effectively using the new +neural architecture of Bayesian encoder-decoders, which can be trained with stochastic gradient descent and +yields a simple inference procedure. + + We implement our ideas in a system, called BAYOU , for the synthesis of API-heavy Java methods. We train +BAYOU on a large corpus of Android apps, and find that the trained system can often synthesize complex +methods given just a few API method names or data types as evidence. The experiments also justify the design +choice of using a latent intent variable and the levels of abstraction at which sketches and evidence are defined. +	code generation API
2018	Learning Loop Invariants for Program Verification + + + + +	Xujie Si, Hanjun Dai, Mukund Raghothaman, Mayur Naik, Le Song	NeurIPS	A fundamental problem in program verification concerns inferring loop invariants. +The problem is undecidable and even practical instances are challenging. Inspired +by how human experts construct loop invariants, we propose a reasoning framework +CODE2INV +that constructs the solution by multi-step decision making and querying +an external program graph memory block. By training with reinforcement learning, +CODE2INV +captures rich program features and avoids the need for ground truth +solutions as supervision. Compared to previous learning tasks in domains with +graph-structured data, it addresses unique challenges, such as a binary objective +function and an extremely sparse reward that is given by an automated theorem +prover only after the complete loop invariant is proposed. We evaluate +CODE2INV on +a suite of 133 benchmark problems and compare it to three state-of-the-art systems. +It solves 106 problems compared to 73 by a stochastic search-based system, 77 by +a heuristic search-based system, and 100 by a decision tree learning-based system. +Moreover, the strategy learned can be generalized to new programs: compared to +solving new instances from scratch, the pre-trained agent is more sample efficient +in finding solutions. +	program analysis verification
2018	Building Language Models for Text with Named Entities + + + + +	M.R. Parvez, Saikat Chakraborty, Baishakhi Ray, KW Chang	ACL	Text in many domains involves a significant amount of named entities. Predicting the entity names is often challenging +for a language model as they appear less +frequent on the training corpus. In this +paper, we propose a novel and effective +approach to building a discriminative language model which can learn the entity +names by leveraging their entity type information. We also introduce two benchmark datasets based on recipes and Java +programming codes, on which we evaluate the proposed model. Experimental results show that our model achieves 52.2% +better perplexity in recipe generation and +22.06% on code generation than the state-of-the-art language models. +	language model
2018	A Deep Learning Approach to Identifying Source Code in Images and Video + + + + +	Jordan Ott, Abigail Atchison, Paul Harnack, Adrienne Bergh, Erik Linstead.	MSR	While substantial progress has been made in mining code on an +Internet scale, efforts to date have been overwhelmingly focused on +data sets where source code is represented natively as text. Large +volumes of source code available online and embedded in technical +videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing +approaches to code extraction and indexing in this environment rely +heavily on computationally intense optical character recognition. +To improve the ease and efficiency of identifying this embedded +code, as well as identifying similar code examples, we develop a +deep learning solution based on convolutional neural networks and +autoencoders. Focusing on Java for proof of concept, our technique +is able to identify the presence of typeset and handwritten source +code in thousands of video images with 85.6%-98.6% accuracy based +on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides +a more scalable basis for video indexing that can be incorporated +into existing software search and mining tools. +	information extraction
2018	Learning to Repair Software Vulnerabilities with Generative Adversarial Networks + + + + +	Jacob Harer, Onur Ozdemir, Tomo Lazovich, Christopher P. Reale, Rebecca L. Russell, Louis Y. Kim, Peter Chin	NeurIPS	Motivated by the problem of automated repair of software vulnerabilities, we propose an adversarial learning approach that maps from one discrete source domain to another target domain without requiring paired labeled examples or source and target domains to be bijections. We demonstrate that the proposed adversarial learning approach is an effective technique for repairing software vulnerabilities, performing close to seq2seq approaches that require labeled pairs. The proposed Generative Adversarial Network approach is application-agnostic in that it can be applied to other problems similar to code repair, such as grammar correction or sentiment translation. +	repair code generation
2018	Syntax and Sensibility: Using language models to detect and correct syntax errors + + + + +	Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, José Nelson Amaral	SANER	Syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of experience that help them quickly resolve these frustrating errors. Standard LR parsers are of little help, typically resolving syntax errors and their precise location poorly. We propose a methodology that locates where syntax errors occur, and suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by using language models trained on correct source code to find tokens that seem out of place. Fixes are synthesized by consulting the language models to determine what tokens are more likely at the estimated error location. We compare n-gram and LSTM (long short-term memory) language models for this task, each trained on a large corpus of Java code collected from GitHub. Unlike prior work, our methodology does not rely that the problem source code comes from the same domain as the training data. We evaluated against a repository of real student mistakes. Our tools are able to find a syntactically-valid fix within its top-2 suggestions, often producing the exact fix that the student used to resolve the error. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors. +	repair language model
2018	Intelligent code reviews using deep learning + + + + +	Anshul Gupta, Neel Sundaresan	KDD	Peer code review is a best practice in Software Engineering where source code is reviewed manually by one or more peers(reviewers) of the code author. It is widely acceptable both in industry and open-source software (OSS) systems as a process for early detection and reduction of software defects. A larger chunk of reviews given during peer reviews are related to common issues such as coding style, documentations, and best practices. This makes the code review process less effective as reviewers focus less on finding important defects. Hence, there is a need to automatically find such common issues and help reviewers perform focused code reviews. Some of this is solved by rule based systems called linters but they are rigid and needs a lot of manual effort to adapt them for a new issue. + + In this work, we present an automatic, flexible, and adaptive code analysis system called DeepCodeReviewer (DCR). DCR learns how to recommend code reviews related to common issues using historical peer reviews and deep learning. DCR uses deep learning to learn review relevance to a code snippet and recommend the right review from a repository of common reviews. DCR is trained on histroical peer reviews available from internal code repositories at Microsoft. Experiments demonstrate strong performance of developed deep learning model in classifying relevant and non-relevant reviews w.r.t to a code snippet, and ranking reviews given a code snippet. We have also evaluated DCR recommentations using a user study and survey. The results of our user study show good acceptance rate and answers of our survey questions are strongly correlated with our system’s goal of making code reviews focused on finding defects. +	representation review
2018	An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation + + + + +	Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk		Millions of open-source projects with numerous bug fixes are available in code repositories. This proliferation of software development histories can be leveraged to learn how to fix common programming bugs. To explore such a potential, we perform an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. First, we mine millions of bug-fixes from the change histories of projects hosted on GitHub, in order to extract meaningful examples of such bug-fixes. Next, we abstract the buggy and corresponding fixed code, and use them to train an Encoder-Decoder model able to translate buggy code into its fixed version. In our empirical investigation we found that such a model is able to fix thousands of unique buggy methods in the wild. Overall, this model is capable of predicting fixed patches generated by developers in 9-50% of the cases, depending on the number of candidate patches we allow it to generate. Also, the model is able to emulate a variety of different Abstract Syntax Tree operations and generate candidate patches in a split second. +	repair
2018	Path-Based Function Embedding and its Application to Specification Mining + + + + +	Daniel DeFreez, Aditya V. Thakur, Cindy Rubio-González	ICSE	Identifying the relationships among program elements is useful +for program understanding, debugging, and analysis. One such +relationship is synonymy. Function synonyms are functions that +play a similar role in code, e.g. functions that perform initialization +for different device drivers, or functions that implement different +symmetric-key encryption schemes. Function synonyms are not +necessarily semantically equivalent and can be syntactically dissimilar; consequently, approaches for identifying code clones or +functional equivalence cannot be used to identify them. This paper presents `func2vec`, an algorithm that maps each function to a vector in a vector space such that function synonyms are grouped +together. We compute the function embedding by training a neu- +ral network on sentences generated from random walks over an +encoding of the program as a labeled pushdown system (ℓ-PDS). +We demonstrate that `func2vec` +is effective at identifying function +synonyms in the Linux kernel. Furthermore, we show how function +synonyms enable mining error-handling specifications with high +support in Linux file systems and drivers. +	program analysis representation
2018	RefiNym: Using Names to Refine Types + + + + +	Santanu Dash, Miltiadis Allamanis, Earl T. Barr	FSE	Source code is bimodal: it combines a formal algorithmic channel and a natural language channel of identifiers and comments. In this work, we model the bimodality of code with name lows, an assignment low graph augmented to track identiier names. Conceptual types are logically distinct types that do not always coincide with program types. Passwords and URLs are example conceptual types that can share the program type string. Our tool, RefiNym, is an unsupervised method that mines a lattice of conceptual types from name lows and reiies them into distinct nominal types. For string, RefiNym inds and splits conceptual types originally merged into a single type, reducing the number of same-type variables per scope from 8.7 to 2.2 while eliminating 21.9% of scopes that have more than one same-type variable in scope. This makes the code more self-documenting and frees the type system to prevent a developer from inadvertently assigning data across conceptual types. +	program analysis types
2018	Open Vocabulary Learning on Source Code with a Graph-Structured Cache + + + + +	Milan Cvitkovic, Badal Singh, Anima Anandkumar		Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models’ performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time. +	GNN variable misuse defect representation
2018	Deep Learning Similarities from Different Representations of Source Code + + + + +	Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk	MSR	Assessing the similarity between code components plays a pivotal +role in a number of Software Engineering (SE) tasks, such as clone +detection, impact analysis, refactoring, etc. +Code similarity is generally measured by relying on manually defined or hand-crafted +features, e.g., by analyzing the overlap among identifiers or comparing the Abstract Syntax Trees of two code components. These +features represent a best guess at what SE researchers can utilize to +exploit and reliably assess code similarity for a given task. Recent +work has shown, when using a stream of identifiers to represent +the code, that Deep Learning (DL) can effectively replace manual +feature engineering for the task of clone detection. However, source +code can be represented at different levels of abstraction: identifiers, Abstract Syntax Trees, Control Flow Graphs, and Bytecode. +We conjecture that each code representation can provide a different, +yet orthogonal view of the same code fragment, thus, enabling a +more reliable detection of similarities in code. In this paper, we +demonstrate how SE tasks can benefit from a DL-based approach, +which can automatically learn code similarities from different representations. +	representation clone
2018	Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow + + + + +	Pengcheng Yin, B. Deng, E. Chen, B. Vasilescu, Graham Neubig	MSR	For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data. + +	dataset
2018	Learning to Generate Corrective Patches using Neural Machine Translation + + + + +	Hideaki Hata, Emad Shihab, Graham Neubig		Bug fixing is generally a manually-intensive task. However, recent work has proposed the idea of automated program repair, which aims to repair (at least a subset of) bugs in different ways such as code mutation, etc. Following in the same line of work as automated bug repair, in this paper we aim to leverage past fixes to propose fixes of current/future bugs. Specifically, we propose Ratchet, a corrective patch generation system using neural machine translation. By learning corresponding pre-correction and post-correction code in past fixes with a neural sequence-to-sequence model, Ratchet is able to generate a fix code for a given bug-prone code query. We perform an empirical study with five open source projects, namely Ambari, Camel, Hadoop, Jetty and Wicket, to evaluate the effectiveness of Ratchet. Our findings show that Ratchet can generate syntactically valid statements 98.7% of the time, and achieve an F1-measure between 0.41-0.83 with respect to the actual fixes adopted in the code base. In addition, we perform a qualitative validation using 20 participants to see whether the generated statements can be helpful in correcting bugs. Our survey showed that Ratchet’s output was considered to be helpful in fixing the bugs on many occasions, even if fix was not 100% correct. +	repair code generation
2018	Public Git Archive: a Big Code dataset for all + + + + +	Vadim Markovtsev, Waren Long	MSR	The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research potential enabled by the available open source code is clearly substantial, no significant large-scale open source code datasets exist. In this paper, we present the Public Git Archive – dataset of 182,014 top-bookmarked Git repositories from GitHub. We describe the novel data retrieval pipeline to reproduce it. We also elaborate on the strategy for performing dataset updates and legal issues. The Public Git Archive occupies 3.0 TB on disk and is an order of magnitude larger than the current source code datasets. The dataset is made available through HTTP and provides the source code of the projects, the related metadata, and development history. The data retrieval pipeline employs an optimized worker queue model and an optimized archive format to efficiently store forked Git repositories, reducing the amount of data to download and persist. Public Git Archive aims to open a myriad of new opportunities for Big Code research. +	dataset
2018	A Retrieve-and-Edit Framework for Predicting Structured Outputs + + + + +	Tatsunori B. Hashimoto, Kelvin Guu, Yonatan Oren, Percy S. Liang	NeurIPS	For the task of generating complex outputs such as source code, editing existing +outputs can be easier than generating complex outputs from scratch. With this +motivation, we propose an approach that first retrieves a training example based on +the input (e.g., natural language description) and then edits it to the desired output +(e.g., code). Our contribution is a computationally efficient method for learning +a retrieval model that embeds the input in a task-dependent way without relying +on a hand-crafted metric or incurring the expense of jointly training the retriever +with the editor. Our retrieve-and-edit framework can be applied on top of any +base model. We show that on a new autocomplete task for GitHub Python code +and the Hearthstone cards benchmark, retrieve-and-edit significantly boosts the +performance of a vanilla sequence-to-sequence model on both tasks. +	bimodal search code generation
2018	Deep Reinforcement Learning for Programming Language Correction + + + + +	Rahul Gupta, Aditya Kanade, Shirish Shevade		Novice programmers often struggle with the formal +syntax of programming languages. To assist them, +we design a novel programming language correction framework amenable to reinforcement learning. The framework allows an agent to mimic human actions for text navigation and editing. We +demonstrate that the agent can be trained through +self-exploration directly from the raw input, that is, +program text itself, without any knowledge of the +formal syntax of the programming language. We +leverage expert demonstrations for one tenth of the +training data to accelerate training. The proposed +technique is evaluated on 6975 +erroneous C programs with typographic errors, written by students +during an introductory programming course. Our +technique fixes 14% +more programs and 29% more +compiler error messages relative to those fixed by +a state-of-the-art tool, DeepFix, which uses a fully +supervised neural machine translation approach. +	repair code generation
2018	Deep Code Search + + + + +	Xiaodong Gu, Hongyu Zhang, Sunghun Kim.	ICSE	To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code. + + In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled. + + As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques. + +	search
2017	Finding Likely Errors with Bayesian Specifications + + + + +	Vijayaraghavan Murali, Swarat Chaudhuri, Chris Jermaine		We present a Bayesian framework for learning probabilistic specifications from large, unstructured code corpora, and +a method to use this framework to statically detect anomalous, hence likely buggy, program behavior. The distinctive +insight here is to build a statistical model that correlates all +specifications hidden inside a corpus with the syntax and +observed behavior of programs that implement these specifications. During the analysis of a particular program, this +model is conditioned into a posterior distribution that prioritizes specifications that are relevant to this program. This +allows accurate program analysis even if the corpus is highly +heterogeneous. The problem of finding anomalies is now +framed quantitatively, as a problem of computing a distance +between a “reference distribution” over program behaviors +that our model expects from the program, and the distribution over behaviors that the program actually produces. + + We present a concrete embodiment of our framework that +combines a topic model and a neural network model to learn +specifications, and queries the learned models to compute +anomaly scores. We evaluate this implementation on the +task of detecting anomalous usage of Android APIs. Our +encouraging experimental results show that the method can +automatically discover subtle errors in Android applications +in the wild, and has high precision and recall compared to +competing probabilistic approaches. +	program analysis API
2017	Exploring API Embedding for API Usages and Applications + + + + +	Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, Tien N. Nguyen	ICSE	Word2Vec is a class of neural network models that +as being trained from a large corpus of texts, they can produce for +each unique word a corresponding vector in a continuous space in +which linguistic contexts of words can be observed. In this work, +we study the characteristics of Word2Vec vectors, called API 2 VEC +or API embeddings, for the API elements within the API sequences in source code. Our empirical study shows that the close +proximity of the API 2 VEC vectors for API elements reflects the +similar usage contexts containing the surrounding APIs of those +API elements. Moreover, API 2 VEC can capture several similar +semantic relations between API elements in API usages via vector +offsets. We demonstrate the usefulness of API 2 VEC vectors for +API elements in three applications. First, we build a tool that mines the pairs of API elements that share the same usage relations +among them. The other applications are in the code migration +domain. We develop API 2 API , a tool to automatically learn the +API mappings between Java and C# using a characteristic of the +API 2 VEC vectors for API elements in the two languages: semantic +relations among API elements in their usages are observed in the +two vector spaces for the two languages as similar geometric +arrangements among their API 2 VEC vectors. Our empirical +evaluation shows that API 2 API relatively improves 22.6% and +40.1% top-1 and top-5 accuracy over a state-of-the-art mining +approach for API mappings. Finally, as another application in +code migration, we are able to migrate equivalent API usages +from Java to C# with up to 90.6% recall and 87.2% precision. +	API representation
2017	DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning + + + + +	Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim	IJCAI	Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a large-scale code corpus without bilingual projects. The key component of DeepAM is based on the multimodal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings, when compared with the state-of-the-art approaches. +	API
2017	Are Deep Neural Networks the Best Choice for Modeling Source Code? + + + + +	Vincent J. Hellendoorn, Premkumar Devanbu	FSE	Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source +code. We argue here that the special properties of source code can +be exploited for further improvements. In this work, we enhance +established language modeling approaches to handle the special +challenges of modeling source code, such as: frequent changes, +larger, changing vocabularies, deeply nested scopes, etc. We present +a fast, nested language modeling toolkit specifically designed for +software, with the ability to add & remove text, and mix & swap out +many models. Specifically, we improve upon prior cache-modeling +work and present a model with a much more expansive, multi-level +notion of locality that we show to be well-suited for modeling +software. We present results on varying corpora in comparison +with traditional N -gram, as well as RNN, and LSTM deep-learning +language models, and release all our source code for public use. +Our evaluations suggest that carefully adapting N-gram models for +source code can yield performance that surpasses even RNN and +LSTM based deep-learning models. +	language model
2017	Abridging Source Code + + + + +	Binhang Yuan, Vijayaraghavan Murali, Christopher Jermaine	OOPSLA	In this paper, we consider the problem of source code abridgment, where the goal is to remove statements from a source code in order to display the source code in a small space, while at the same time leaving the ``important’’ parts of the source code intact, so that an engineer can read the code and quickly understand purpose of the code. To this end, we develop an algorithm that looks at a number of examples, human-created source code abridgments, and learns how to remove lines from the code in order to mimic the human abridger. The learning algorithm takes into account syntactic features of the code, as well as semantic features such as control flow and data dependencies. Through a comprehensive user study, we show that the abridgments that our system produces can decrease the time that a user must look at code in order to understand its functionality, as well as increase the accuracy of the assessment, while displaying the code in a greatly reduced area. +	summarization
2017	Automatically Generating Commit Messages from Diffs using Neural Machine Translation + + + + +	Siyuan Jiang, Ameer Armaly, Collin McMillan	ASE	Commit messages are a valuable resource in comprehension of software evolution, since they provide a record of changes such as feature additions and bug repairs. Unfortunately, programmers often neglect to write good commit messages. Different techniques have been proposed to help programmers by automatically writing these messages. These techniques are effective at describing what changed, but are often verbose and lack context for understanding the rationale behind a change. In contrast, humans write messages that are short and summarize the high level rationale. In this paper, we adapt Neural Machine Translation (NMT) to automatically “translate” diffs into commit messages. We trained an NMT algorithm using a corpus of diffs and human-written commit messages from the top 1k Github projects. We designed a filter to help ensure that we only trained the algorithm on higher-quality commit messages. Our evaluation uncovered a pattern in which the messages we generate tend to be either very high or very low quality. Therefore, we created a quality-assurance filter to detect cases in which we are unable to produce good messages, and return a warning instead. +	edit bimodal
2017	pix2code: Generating Code from a Graphical User Interface Screenshot + + + + +	Tony Beltramelli		Transforming a graphical user interface screenshot created by a designer into computer code is a typical task conducted by a developer in order to build customized software, websites and mobile applications. In this paper, we show that Deep Learning techniques can be leveraged to automatically generate code given a graphical user interface screenshot as input. Our model is able to generate code targeting three different platforms (i.e. iOS, Android and web-based technologies) from a single input image with over 77% of accuracy. + +	code generation bimodal
2017	Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts + + + + +	Rohan Bavishi, Michael Pradel, Koushik Sen		Most of the JavaScript code deployed in the wild has been minified, a process in which identifier names are replaced +with short, arbitrary and meaningless names. Minified code occupies less space, but also makes the code extremely difficult to manually inspect and understand. This paper presents Context2Name, a deep learning-based technique that partially reverses the effect of minification by predicting natural +identifier names for minified names. The core idea is to predict from the usage context of a variable a name that captures +the meaning of the variable. The approach combines a lightweight, token-based static analysis with an auto-encoder +neural network that summarizes usage contexts and a recurrent neural network that predict natural names for a given +usage context. We evaluate Context2Name +with a large corpus of real-world JavaScript code and show that it successfully predicts 60.4% of all minified identifiers. A comparison +with the state-of-the-art tools JSNice and JSNaughty shows +that our approach predicts 17% and 43% more names than the +best existing approaches, while taking only 2.6 milliseconds +to predict a name, on average. +	naming
2017	DeepFix: Fixing Common C Language Errors by Deep Learning + + + + +	Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade	AAAI	The problem of automatically fixing programming errors is a +very active research topic in software engineering. This is a +challenging problem as fixing even a single error may require +analysis of the entire program. In practice, a number of errors +arise due to programmer’s inexperience with the programming language or lack of attention to detail. We call these +common programming errors. These are analogous to grammatical errors in natural languages. Compilers detect such errors, but their error messages are usually inaccurate. In this +work, we present an end-to-end solution, called DeepFix, that +can fix multiple such errors in a program without relying on +any external tool to locate or fix them. At the heart of DeepFix +is a multi-layered sequence-to-sequence neural network with +attention which is trained to predict erroneous program locations along with the required correct statements. On a set of +6971 erroneous C programs written by students for 93 programming tasks, DeepFix could fix 1881 (27%) programs +completely and 1338 (19%) programs partially. +	repair code generation
2017	Function Assistant: A Tool for NL Querying of APIs + + + + +	Kyle Richardson, Jonas Kuhn	EMNLP	In this paper, we describe Function Assistant, a lightweight Python-based toolkit for querying and exploring source code repositories using natural language. The toolkit is designed to help end-users of a target API quickly find information about functions through high-level natural language queries and descriptions. For a given text query and background API, the tool finds candidate functions by performing a translation from the text to known representations in the API using the semantic parsing approach of Richardson and Kuhn (2017). Translations are automatically learned from example text-code pairs in example APIs. The toolkit includes features for building translation pipelines and query engines for arbitrary source code projects. To explore this last feature, we perform new experiments on 27 well-known Python projects hosted on Github. +	bimodal API
2017	A parallel corpus of Python functions and documentation strings for automated code documentation and code generation + + + + +	Antonio Valerio Miceli Barone, Rico Sennrich		Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains. + + In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings (“docstrings”) generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with +data augmentation techniques to further increase the amount of training data. + + We release our datasets and processing scripts in order to stimulate research in these areas. + +	documentation summarization dataset
2017	Neural Attribute Machines for Program Generation + + + + +	Matthew Amodio, Swarat Chaudhuri, Thomas W. Reps		Recurrent neural networks have achieved remarkable success at generating sequences with complex structures, thanks to advances that include richer embeddings of input and cures for vanishing gradients. Trained only on sequences from a known grammar, though, they can still struggle to learn rules and constraints of the grammar. Neural Attribute Machines (NAMs) are equipped with a logical machine that represents the underlying grammar, which is used to teach the constraints to the neural machine by (i) augmenting the input sequence, and (ii) optimizing a custom loss function. Unlike traditional RNNs, NAMs are exposed to the grammar, as well as samples from the language of the grammar. During generation, NAMs make significantly fewer violations of the constraints of the underlying grammar than RNNs trained only on samples from the language of the grammar. + +	grammar code generation representation
2017	Learning to Align the Source Code to the Compiled Object Code + + + + +	Dor Levy, Lior Wolf	ICML	We propose a new neural network architecture +and use it for the task of statement-by-statement +alignment of source code and its compiled object code. Our architecture learns the alignment +between the two sequences – one being the translation of the other – by mapping each statement +to a context-dependent representation vector and +aligning such vectors using a grid of the two sequence domains. Our experiments include short +C functions, both artificial and human-written, +and show that our neural network architecture +is able to predict the alignment with high accuracy, outperforming known baselines. We also +demonstrate that our model is general and can +learn to solve graph problems such as the Traveling Salesman Problem. +	decompilation
2017	Program Synthesis from Natural Language Using Recurrent Neural Networks + + + + +	Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Michael D. Ernst	Technical Report UW-CSE-17-03-01, University of Washington Department of Computer Science and Engineering	Oftentimes, a programmer may have difficulty implementing a +desired operation. Even when the programmer can describe her +goal in English, it can be difficult to translate into code. Existing +resources, such as question-and-answer websites, tabulate specific +operations that someone has wanted to perform in the past, but +they are not effective in generalizing to new tasks, to compound +tasks that require combining previous questions, or sometimes even +to variations of listed tasks. + + Our goal is to make programming easier and more productive by +letting programmers use their own words and concepts to express +the intended operation, rather than forcing them to accommodate +the machine by memorizing its grammar. We have built a system +that lets a programmer describe a desired operation in natural language, then automatically translates it to a programming language +for review and approval by the programmer. Our system, Tellina, +does the translation using recurrent neural networks (RNNs), a +state-of-the-art natural language processing technique that we augmented with slot (argument) filling and other enhancements. + + We evaluated Tellina in the context of shell scripting. We trained +Tellina’s RNNs on textual descriptions of file system operations +and bash one-liners, scraped from the web. Although recovering +completely correct commands is challenging, Tellina achieves top-3 +accuracy of 80% for producing the correct command structure. In a +controlled study, programmers who had access to Tellina outperformed those who did not, even when Tellina’s predictions were +not completely correct, to a statistically significant degree. +	bimodal code generation
2017	Autofolding for Source Code Summarization + + + + +	Jaroslav Fowkes, Razan Ranca, Miltiadis Allamanis, Mirella Lapata, Charles Sutton	TSE	Developers spend much of their time reading and browsing source code, raising new opportunities for summarization methods. Indeed, modern code editors provide code folding, which allows one to selectively hide blocks of code. However this is impractical to use as folding decisions must be made manually or based on simple rules. We introduce the +autofolding problem, which is to automatically create a code summary by folding less informative code regions. We present a novel solution by formulating the problem as a sequence of AST folding decisions, leveraging a scoped topic model for code tokens. On an annotated set of popular open source projects, we show that our summarizer outperforms simpler baselines, yielding a 28% error reduction. Furthermore, we find through a case study that our summarizer is strongly preferred by experienced developers. More broadly, we hope this work will aid program comprehension by turning code folding into a usable and valuable tool. +	summarization
2017	A Language Model for Statements of Software Code + + + + +	Yixiao Yang, Yu Jiang, Ming Gu, Jiaguang Sun, Jian Gao, Han Liu	ASE	Building language models for source code enables a large set of improvements on traditional software engineering tasks. One promising application is automatic code completion. State-of-the-art techniques capture code regularities at token level with lexical information. Such language models are more suitable for predicting short token sequences, but become less effective with respect to long statement level predictions. In this paper, we have proposed PCC to optimize the token level based language modeling. Specifically, PCC introduced an intermediate representation (IR) for source code, which puts tokens into groups using lexeme and variable relative order. In this way, PCC is able to handle long token sequences, i.e., group sequences, to suggest a complete statement with the precise synthesizer. Further more, PCC employed a fuzzy matching technique which combined genetic and longest common sub-sequence algorithms to make the prediction more accurate. We have implemented a code completion plugin for Eclipse and evaluated it on open-source Java projects. The results have demonstrated the potential of PCC in generating precise long statement level predictions. In 30%-60% of the cases, it can correctly suggest the complete statement with only six candidates, and 40%-90% of the cases with ten candidates. +	language model
2017	Mining Semantic Loop Idioms from Big Code + + + + +	Miltiadis Allamanis, Earl T. Barr, Christian Bird, Mark Marron, Charles Sutton	TSE	During maintenance, developers spend a lot of time transforming existing code: refactoring, optimizing, and adding checks to make it more robust. Much of this work is the drudgery of identifying and replacing specific patterns, yet it resists automation, because of meaningful patterns are hard to automatically find. We present a technique for mining loop idioms, surprisingly probable semantic patterns that occur in loops, from big code to find meaningful patterns. First, we show that automatically identifiable patterns exist, in great numbers, with a large scale empirical study of loop over 25 MLOC. We find that loops in this corpus are simple and predictable: 90% of them have fewer than 15LOC and 90% have no nesting and very simple control structure. Encouraged by this result, we coil loops to abstract away syntactic diversity to define information rich loop idioms. We show that only 50 loop idioms cover 50% of the concrete loops. We show how loop idioms can help a tool developers identify and prioritize refactorings. We also show how our framework opens the door to data-driven tool and language design discovering opportunities to introduce new API calls and language constructs: loop idioms show that LINQ would benefit from an Enumerate operator, a result confirmed by the fact that precisely this feature is one of the most requested features on StackOverflow with 197 votes and 95k views. +	pattern mining grammar
2017	SmartPaste: Learning to Adapt Source Code + + + + +	Miltiadis Allamanis, Marc Brockschmidt		Deep Neural Networks have been shown to succeed at a range of natural +language tasks such as machine translation and text summarization. +While tasks on source code (ie, formal languages) have been considered +recently, most work in this area does not attempt to capitalize on the +unique opportunities offered by its known syntax and structure. In this +work, we introduce SmartPaste, a first task that requires to use such +information. The task is a variant of the program repair problem that +requires to adapt a given (pasted) snippet of code to surrounding, +existing source code. As first solutions, we design a set of deep +neural models that learn to represent the context of each variable +location and variable usage in a data flow-sensitive way. Our +evaluation suggests that our models can learn to solve the SmartPaste +task in many cases, achieving 58.6% accuracy, while learning meaningful +representation of variable usages. +	representation variable misuse
2017	A Syntactic Neural Model for General-Purpose Code Generation + + + + +	Pengcheng Yin, Graham Neubig	ACL	We consider the problem of parsing natural language descriptions into source code +written in a general-purpose programming +language like Python. Existing data-driven methods treat this problem as a language generation task without considering +the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture +powered by a grammar model to explicitly +capture the target syntax as prior knowledge. Experiments find this an effective +way to scale up to generation of complex +programs from natural language descriptions, achieving state-of-the-art results that +well outperform previous code generation +and semantic parsing approaches. +	code generation grammar bimodal
2017	Code Completion with Neural Attention and Pointer Networks + + + + +	Jian Li, Yue Wang, Michael R. Lyu, Irwin King		Intelligent code completion has become an essential tool to accelerate modern software development. To facilitate effective code completion for dynamically-typed programming languages, we apply neural language models by learning from large codebases, and investigate the effectiveness of attention mechanism on the code completion task. However, standard neural language models even with attention mechanism cannot correctly predict out-of-vocabulary (OoV) words thus restrict the code completion performance. In this paper, inspired by the prevalence of locally repeated terms in program source code, and the recently proposed pointer networks which can reproduce words from local context, we propose a pointer mixture network for better predicting OoV words in code completion. Based on the context, the pointer mixture network learns to either generate a within-vocabulary word through an RNN component, or copy an OoV word from local context through a pointer component. Experiments on two benchmarked datasets demonstrate the effectiveness of our attention mechanism and pointer mixture network on the code completion task. + +	language model autocomplete
2017	The Code2Text Challenge: Text Generation in Source Code Libraries + + + + +	Kyle Richardson, Sina Zarrieß, Jonas Kuhn	INLG	We propose a new shared task for tactical data-to-text generation in the domain of source code libraries. Specifically, we focus on text generation of function descriptions from example software projects. Data is drawn from existing resources used for studying the related problem of semantic parser induction (Richardson and Kuhn, 2017b; Richardson and Kuhn, 2017a), and spans a wide variety of both natural languages and programming languages. In this paper, we describe these existing resources, which will serve as training and development data for the task, and discuss plans for building new independent test sets. +	bimodal
2017	Deep Learning to Find Bugs + + + + +	Michael Pradel, Koushik Sen		Automated bug detection, e.g., through pattern-based static +analysis, is an increasingly popular technique to find programming errors and other code quality issues. Traditionally, +bug detectors are program analyses that are manually written and carefully tuned by an analysis expert. Unfortunately, +the huge amount of possible bug patterns makes it difficult +to cover more than a small fraction of all bugs. This paper +presents a new approach toward creating bug detectors. The +basic idea is to replace manually writing a program analysis +with training a machine learning model that distinguishes +buggy from non-buggy code. To address the challenge that +effective learning requires both positive and negative train- +ing examples, we use simple code transformations that create likely incorrect code from existing code examples. We +present a general framework, called DeepBugs, that extracts +positive training examples from a code corpus, leverages +simple program transformations to create negative training +examples, trains a model to distinguish these two, and then +uses the trained model for identifying programming mistakes in previously unseen code. As a proof of concept, we +create four bug detectors for JavaScript that find a diverse set +of programming mistakes, e.g., accidentally swapped function arguments, incorrect assignments, and incorrect binary +operations. To find bugs, the trained models use information +that is usually discarded by program analyses, such as identifier names of variables and functions. Applying the approach +to a corpus of 150,000 JavaScript files shows that learned bug +detectors have a high accuracy, are very efficient, and reveal +132 programming mistakes in real-world code. + +	defect program analysis
2017	Abstract Syntax Networks for Code Generation and Semantic Parsing + + + + +	Maxim Rabinovich, Mitchell Stern, Dan Klein	ACL	Tasks like code generation and semantic parsing require mapping unstructured (or partially structured) inputs to well-formed, executable outputs. We introduce abstract syntax networks, a modeling framework for these problems. The outputs are represented as abstract syntax trees (ASTs) and constructed by a decoder with a dynamically-determined modular structure paralleling the structure of the output tree. On the benchmark Hearthstone dataset for code generation, our model obtains 79.2 BLEU and 22.7% exact match accuracy, compared to previous state-of-the-art values of 67.1 and 6.1%. Furthermore, we perform competitively on the Atis, Jobs, and Geo semantic parsing datasets with no task-specific engineering. +	code generation grammar
2017	A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes + + + + +	Pablo Loyola, Edison Marrese-Taylor, Yutaka Matsuo		We propose a model to automatically describe changes introduced in the source code of a program using natural language. Our method receives as input a set of code commits, which contains both the modifications and message introduced by an user. These two modalities are used to train an encoder-decoder architecture. We evaluated our approach on twelve real world open source projects from four different programming languages. Quantitative and qualitative results showed that the proposed approach can generate feasible and semantically sound descriptions not only in standard in-project settings, but also in a cross-project setting. +	edit summarization
2017	Learning a Classifier for False Positive Error Reports Emitted by Static Code Analysis Tools + + + + +	Ugur Koc, Parsa Saadatpanah, Jeffrey S. Foster, Adam A. Porter.	MAPL	The large scale and high complexity of modern software systems +make perfectly precise static code analysis (SCA) infeasible. Therefore SCA tools often over-approximate, so not to miss any real +problems. This, however, comes at the expense of raising false +alarms, which, in practice, reduces the usability of these tools. + + To partially address this problem, we propose a novel learning +process whose goal is to discover program structures that cause +a given SCA tool to emit false error reports, and then to use this +information to predict whether a new error report is likely to be a +false positive as well. To do this, we first preprocess code to isolate +the locations that are related to the error report. Then, we apply +machine learning techniques to the preprocessed code to discover +correlations and to learn a classifier. + + We evaluated this approach in an initial case study of a widely-used SCA tool for Java. Our results showed that for our dataset +we could accurately classify a large majority of false positive error +reports. Moreover, we identified some common coding patterns that +led to false positive errors. We believe that SCA developers may be +able to redesign their methods to address these patterns and reduce +false positive error reports. +	static analysis
2017	Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities + + + + +	Martin White, Michele Tufano, Matias Martinez, Martin Monperrus, Denys Poshyvanyk	SANER	In the field of automated program repair, the redundancy assumption claims large programs contain the seeds +of their own repair. However, most redundancy-based program +repair techniques do not reason about the repair ingredients—the code that is reused to craft a patch. We aim to reason about +the repair ingredients by using code similarities to prioritize and +transform statements in a codebase for patch generation. Our +approach, DeepRepair, relies on deep learning to reason about +code similarities. Code fragments at well-defined levels of granularity in a codebase can be sorted according to their similarity +to suspicious elements (i.e., code elements that contain suspicious +statements) and statements can be transformed by mapping out-of-scope identifiers to similar identifiers in scope. We examined +these new search strategies for patch generation with respect to +effectiveness from the viewpoint of a software maintainer. Our +comparative experiments were executed on six open-source Java +projects including 374 buggy program revisions and consisted +of 19,949 trials spanning 2,616 days of computation time. DeepRepair’s search strategy using code similarities generally found +compilable ingredients faster than the baseline, jGenProg, but +this improvement neither yielded test-adequate patches in fewer +attempts (on average) nor found significantly more patches than +the baseline. Although the patch counts were not statistically +different, there were notable differences between the nature of +DeepRepair patches and baseline patches. The results demonstrate that our learning-based approach finds patches that cannot +be found by existing redundancy-based repair techniques +	repair
2017	Synthesizing benchmarks for predictive modeling + + + + +	Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather	CGO	Predictive modeling using machine learning is an effective method for building compiler heuristics, but there is a shortage of benchmarks. Typical machine learning experiments outside of the compilation field train over thousands or millions of examples. In machine learning for compilers, however, there are typically only a few dozen common benchmarks available. This limits the quality of learned models, as they have very sparse training data for what are often high-dimensional feature spaces. What is needed is a way to generate an unbounded number of training programs that finely cover the feature space. At the same time the generated programs must be similar to the types of programs that human developers actually write, otherwise the learning will target the wrong parts of the feature space. We mine open source repositories for program fragments and apply deep learning techniques to automatically construct models for how humans write programs. We sample these models to generate an unbounded number of runnable training programs. The quality of the programs is such that even human developers struggle to distinguish our generated programs from hand-written code. We use our generator for OpenCL programs, CLgen, to automatically synthesize thousands of programs and show that learning over these improves the performance of a state of the art predictive model by 1.27x. In addition, the fine covering of the feature space automatically exposes weaknesses in the feature design which are invisible with the sparse training examples from existing benchmark suites. Correcting these weaknesses further increases performance by 4.30x. +	optimization code generation
2017	Topic modeling of public repositories at scale using names in source code + + + + +	Vadim Markovtsev, Eiso Kant		Programming languages themselves have a limited number of reserved keywords and character based tokens that +define the language specification. However, programmers have a rich use of natural language within their code +through comments, text literals and naming entities. The programmer defined names that can be found in source +code are a rich source of information to build a high level understanding of the project. The goal of this paper +is to apply topic modeling to names used in over 13.6 million repositories and perceive the inferred topics. +One of the problems in such a study is the occurrence of duplicate repositories not officially marked as forks (obscure forks). +We show how to address it using the same identifiers which are extracted for topic modeling. + + We open with a discussion on naming in source code, we then elaborate on our approach to remove exact duplicate +and fuzzy duplicate repositories using Locality Sensitive Hashing on the bag-of-words model and then discuss our work +on topic modeling; and finally present the results from our data analysis together with open-access to the source code, +tools and datasets. +	topic modeling pattern mining
2017	Recovering Clear, Natural Identifiers from Obfuscated JS Names + + + + +	Bogdan Vasilescu, Casey Casalnuovo, Premkumar Devanbu	FSE	Well-chosen variable names are critical to source code readability, reusability, and maintainability. Unfortunately, in deployed JavaScript code (which is ubiquitous on the web) the identifier names are frequently minified and overloaded. This is done both for efficiency and also to protect potentially proprietary intellectual property. In this paper, we describe an approach based on statistical machine translation (SMT) that recovers some of the original names from the JavaScript programs minified by the very popular UglifyJS. This simple tool, Autonym, performs comparably to the best currently available deobfuscator for JavaScript, JSNice, which uses sophisticated static analysis. In fact, Autonym is quite complementary to JSNice, performing well when it does not, and vice versa. We also introduce a new tool, JSNaughty, which blends Autonym and JSNice, and significantly outperforms both at identifier name recovery, while remaining just as easy to use as JSNice. JSNaughty is available online at http://jsnaughty.org. +	deobfuscation naming
2017	End-to-end Deep Learning of Optimization Heuristics + + + + +	Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather		Accurate automatic optimization heuristics are necessary for dealing with the complexity and diversity of modern hardware and software. Machine learning is a proven technique for learning such heuristics, but its success is bound by the quality of the features used. These features must be hand crafted by developers through a combination of expert domain knowledge and trial and error. This makes the quality of the final model directly dependent on the skill and available time of the system architect. + + Our work introduces a better way for building heuristics. We develop a deep neural network that learns heuristics over raw code, entirely without using code features. The neural network simultaneously constructs appropriate representations of the code and learns how best to optimize, removing the need for manual feature creation. Further, we show that our neural nets can transfer learning from one optimization problem to another, improving the accuracy of new models, without the help of human experts. + + We compare the effectiveness of our automatically generated heuristics against ones with features hand-picked by experts. We examine two challenging tasks: predicting optimal mapping for heterogeneous parallelism and GPU thread coarsening factors. In 89% of the cases, the quality of our fully automatic heuristics matches or surpasses that of state-of-the-art predictive models using hand-crafted features, providing on average 14% and 12% more performance with no human effort expended on designing features. +	optimization
2017	CodeSum: Translate Program Language to Natural Language + + + + +	Xing Hu, Yuhan Wei, Ge Li, Zhi Jin		During software maintenance, programmers spend a lot of time on code comprehension. Reading comments is an effective way for programmers to reduce the reading and navigating time when comprehending source code. Therefore, as a critical task in software engineering, code summarization aims to generate brief natural language descriptions for source code. In this paper, we propose a new code summarization model named CodeSum. CodeSum exploits the attention-based sequence-to-sequence (Seq2Seq) neural network with Structure-based Traversal (SBT) of Abstract Syntax Trees (AST). The AST sequences generated by SBT can better present the structure of ASTs and keep unambiguous. We conduct experiments on three large-scale corpora in different program languages, i.e., Java, C#, and SQL, in which Java corpus is our new proposed industry code extracted from Github. Experimental results show that our method CodeSum outperforms the state-of-the-art significantly. +	bimodal summarization
2017	Software Defect Prediction via Convolutional Neural Network + + + + +	Jian Li, Pinjia He, Jieming Zhu, Michael R. Lyu	QRS	To improve software reliability, software defect prediction is utilized to assist developers in finding potential bugs +and allocating their testing efforts. Traditional defect prediction +studies mainly focus on designing hand-crafted features, which +are input into machine learning classifiers to identify defective +code. However, these hand-crafted features often fail to capture +the semantic and structural information of programs. Such +information is important in modeling program functionality and +can lead to more accurate defect prediction. +In this paper, we propose a framework called Defect Prediction +via Convolutional Neural Network (DP-CNN), which leverages +deep learning for effective feature generation. Specifically, based +on the programs’ Abstract Syntax Trees (ASTs), we first extract +token vectors, which are then encoded as numerical vectors +via mapping and word embedding. We feed the numerical +vectors into Convolutional Neural Network to automatically +learn semantic and structural features of programs. After that, +we combine the learned features with traditional hand-crafted +features, for accurate software defect prediction. We evaluate our +method on seven open source projects in terms of F-measure in +defect prediction. The experimental results show that in average, +DP-CNN improves the state-of-the-art method by 12%. + +	defect
2017	Learning Technical Correspondences in Technical Documentation + + + + +	Kyle Richardson, Jonas Kuhn	ACL	We consider the problem of translating high-level textual descriptions to formal representations in technical documentation as part of an effort to model the meaning of such documentation. We focus specifically on the problem of learning translational correspondences between text descriptions and grounded representations in the target documentation, such as formal representation of functions or code templates. Our approach exploits the parallel nature of such documentation, or the tight coupling between high-level text and the low-level representations we aim to learn. Data is collected by mining technical documents for such parallel text-representation pairs, which we use to train a simple semantic parsing model. We report new baseline results on sixteen novel datasets, including the standard library documentation for nine popular programming languages across seven natural languages, and a small collection of Unix utility manuals. +	documentation API bimodal
2017	Semantic Code Repair using Neuro-Symbolic Transformation Networks + + + + +	Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli		We study the problem of semantic code repair, which can be broadly defined as automatically fixing +non-syntactic bugs in source code. The majority of past work in semantic code repair assumed access +to unit tests against which candidate repairs could be validated. In contrast, the goal here is to +develop a strong statistical model to accurately predict both bug locations and exact fixes without +access to information about the intended correct behavior of the program. Achieving such a goal +requires a robust contextual repair model, which we train on a large corpus of real-world source +code that has been augmented with synthetically injected bugs. Our framework adopts a two-stage +approach where first a large set of repair candidates are generated by rule-based processors, and +then these candidates are scored by a statistical model using a novel neural network architecture +which we refer to as Share, Specialize, and Compete. Specifically, the architecture (1) generates +a shared encoding of the source code using an RNN over the abstract syntax tree, +(2) scores each candidate repair using specialized network modules, and (3) then normalizes these +scores together so they can compete against one another in comparable probability space. We evaluate +our model on a real-world test set gathered from GitHub containing four common categories of bugs. +Our model is able to predict the exact correct repair 41% of the time with a single guess, compared +to 13% accuracy for an attentional sequence-to-sequence model. +	repair
2017	Semantically enhanced software traceability using deep learning techniques + + + + +	Jin Guo, Jinghui Cheng, Jane Cleland-Huang	ICSE	In most safety-critical domains the need for traceability is prescribed by certifying bodies. Trace links are generally created among requirements, design, source code, test cases and other artifacts; however, creating such links manually is time consuming and error prone. Automated solutions use information retrieval and machine learning techniques to generate trace links; however, current techniques fail to understand semantics of the software artifacts or to integrate domain knowledge into the tracing process and therefore tend to deliver imprecise and inaccurate results. In this paper, we present a solution that uses deep learning to incorporate requirements artifact semantics and domain knowledge into the tracing solution. We propose a tracing network architecture that utilizes Word Embedding and Recurrent Neural Network (RNN) models to generate trace links. Word embedding learns word vectors that represent knowledge of the domain corpus and RNN uses these word vectors to learn the sentence semantics of requirements artifacts. We trained 360 different configurations of the tracing network using existing trace links in the Positive Train Control domain and identified the Bidirectional Gated Recurrent Unit (BI-GRU) as the best model for the tracing task. BI-GRU significantly out-performed state-of-the-art tracing methods including the Vector Space Model and Latent Semantic Indexing. +	traceability representation
2016	A deep language model for software code + + + + +	Hoa Khanh Dam, Truyen Tran, Trang Pham		Existing language models such as n-grams for software code often fail to capture a long context where dependent code elements scatter far apart. In this paper, we propose a novel approach to build a language model for software code to address this particular issue. Our language model, partly inspired by human memory, is built upon the powerful deep learning-based Long Short Term Memory architecture that is capable of learning long-term dependencies which occur frequently in software code. Results from our intrinsic evaluation on a corpus of Java projects have demonstrated the effectiveness of our language model. This work contributes to realizing our vision for DeepSoft, an end-to-end, generic deep learning-based framework for modeling software and its development process. +	language model code generation
2016	Learning to Fuzz: Application-Independent Fuzz Testing with Probabilistic, Generative Models of Input Data + + + + +	Jibesh Patra, Michael Pradel		Fuzzing is a popular technique to create test inputs for software that processes structured data. It has been successfully +applied in various domains, ranging from compilers and interpreters over program analyses to rendering engines, image manipulation tools, and word processors. Existing fuzz +testing techniques are tailored for a particular purpose and +rely on a carefully crafted model of the data to be generated. +This paper presents TreeFuzz, a generic approach for generating structured data without an a priori known model. The +key idea is to exploit a given corpus of example data to au- +tomatically infer probabilistic, generative models that create +new data with properties similar to the corpus. To support a +wide range of different properties, TreeFuzz is designed as a +framework with an extensible set of techniques to infer generative models. We apply the idea to JavaScript programs +and HTML documents and show that the approach generates mostly valid data for both of them: 96.3% of the generated JavaScript programs are syntactically valid and there are +only 2.06 validation errors per kilobyte of generated HTML. +The performance of both learning and generation scales linearly w.r.t. the size of the corpus. Using TreeFuzz-generated +JavaScript programs for differential testing of JavaScript engines exposes various inconsistencies among browsers, including browser bugs and unimplemented language features. +	fuzzing
2016	Parameter-Free Probabilistic API Mining across GitHub + + + + +	Jaroslav Fowkes, Charles Sutton	FSE	Existing API mining algorithms can be difficult to use as they require expensive parameter tuning and the returned set of API calls can be large, highly redundant and difficult to understand. To address this, we present PAM (Probabilistic API Miner), a near parameter-free probabilistic algorithm for mining the most interesting API call patterns. We show that PAM significantly outperforms both MAPO and UPMiner, achieving 69% test-set precision, at retrieving relevant API call sequences from GitHub. Moreover, we focus on libraries for which the developers have explicitly provided code examples, yielding over 300,000 LOC of hand-written API example code from the 967 client projects in the data set. This evaluation suggests that the hand-written examples actually have limited coverage of real API usages. + +	API pattern mining
2016	Question Independent Grading using Machine Learning: The Case of Computer Program Grading + + + + +	Gursimran Singh, Shashank Srikant, Varun Aggarwal	KDD	Learning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this work, this issue is amplified. The models have to be trained not only for every question but also for every language the question is offered in. Moreover, the availability and time taken by experts to create a labeled set of programs for each question is a major bottleneck in scaling such a system. We address this issue by presenting a method to grade computer programs which requires no manually assigned labeled samples for grading responses to a new, unseen question. We extend our previous work (by Srikant, Aggarwal; KDD 2014) wherein we introduced a grammar of features to learn question specific models. In this work, we propose a method to transform those features into a set of features that maintain their structural relation with the labels across questions. Using these features we learn one supervised model, across questions for a given language, which can then be applied to an ungraded response to an unseen question. We show that our method rivals the performance of both, question specific models and the consensus among human experts while substantially outperforming extant ways of evaluating codes. We demonstrate the system single s value by deploying it to grade programs in a high stakes assessment. The learning from this work is transferable to other grading tasks such as math question grading and also provides a new variation to the supervised learning approach. +	education
2016	Extracting Code from Programming Tutorial Videos + + + + +	Shir Yadid, Eran Yahav	Onward!	The number of programming tutorial videos on the web +increases daily. Video hosting sites such as YouTube host +millions of video lectures, with many programming tutorials for various languages and platforms. These videos contain a wealth of valuable information, including code that +may be of interest. However, two main challenges have so +far prevented the effective indexing of programming tutorial +videos: (i) code in tutorials is typically written on-the-fly, +with only parts of the code visible in each frame, and (ii) optical character recognition (OCR) is not precise enough to +produce quality results from videos. + + We present a novel approach for extracting code from +videos that is based on: (i) consolidating code across frames, +and (ii) statistical language models for applying corrections +at different levels, allowing us to make corrections by choosing the most likely token, combination of tokens that form a +likely line structure, and combination of lines that lead to +a likely code fragment in a particular language. We implemented our approach in a tool called ACE , and used it to extract code from 40 Android video tutorials on YouTube . Our +evaluation shows that ACE extracts code with high accuracy, +enabling deep indexing of video tutorials. +	information extraction
2016	Towards Better Program Obfuscation: Optimization via Language Models + + + + +	Han Liu	ICSE	As a common practice in software development, program +obfuscation aims at deterring reverse engineering and malicious attacks on released source or binary code. Owning ample obfuscation techniques, we have relatively little +knowledge on how to most effectively use them. The biggest +challenge lies in identifying the most useful combination of +these techniques. We propose a unified framework to automatically generate and optimize obfuscation based on an +obscurity language model and a Monte Carlo Markov Chain +(MCMC) based search algorithm. We further instantiate it +for JavaScript programs and developed the Closure tool. +Compared to the well-known Google Closure Compiler, Closure outperforms its default setting by 26%. For programs +which have already been well obfuscated, Closure can still +outperform by 22%. +	deobfuscation
2016	Automatically Learning Semantic Features for Defect Prediction + + + + +	Song Wang, Taiyue Liu, Lin Tan	ICSE	Software defect prediction, which predicts defective code regions, can help developers find bugs and prioritize their testing efforts. To build accurate prediction models, previous +studies focus on manually designing features that encode the +characteristics of programs and exploring different machine +learning algorithms. Existing traditional features often fail +to capture the semantic differences of programs, and such a +capability is needed for building accurate prediction models. + + To bridge the gap between programs’ semantics and +defect prediction features, this paper proposes to leverage a +powerful representation-learning algorithm, deep learning, +to learn semantic representation of programs automatically +from source code. Specifically, we leverage Deep Belief +Network (DBN) to automatically learn semantic features +from token vectors extracted from programs’ Abstract +Syntax Trees (ASTs). + + Our evaluation on ten open source projects shows that +our automatically learned semantic features significantly improve both within-project defect prediction (WPDP) and +cross-project defect prediction (CPDP) compared to traditional features. Our semantic features improve WPDP on +average by 14.7% in precision, 11.5% in recall, and 14.2% +in F1. For CPDP, our semantic features based approach +outperforms the state-of-the-art technique TCA+ with traditional features by 8.9% in F1. +	defect representation
2016	Learning API Usages from Bytecode: A Statistical Approach + + + + +	Tam The Nguyen, Hung Viet Pham, Phong Minh Vu, Tung Thanh Nguyen	ICSE	Mobile app developers rely heavily on standard API frameworks and libraries. However, learning API usages is often challenging due to the fast-changing nature of API frameworks for mobile systems and the insufficiency of API documentation and source code examples. In this paper, we propose a novel approach to learn API usages from bytecode of Android mobile apps. Our core contributions include HAPI, a statistical model of API usages and three algorithms to extract method call sequences from apps’ bytecode, to train HAPI based on those sequences, and to recommend method calls in code completion using the trained HAPIs. Our empirical evaluation shows that our prototype tool can effectively learn API usages from 200 thousand apps containing 350 million method sequences. It recommends next method calls with top-3 accuracy of 90% and outperforms baseline approaches on average 10-20%. +	representation API
2016	Statistical Deobfuscation of Android Applications + + + + +	Benjamin Bichsel, Veselin Raychev, Petar Tsankov, Martin Vechev	CCS	This work presents a new approach for deobfuscating Android APKs based on probabilistic learning of large code bases (termed “Big Code”). The key idea is to learn a probabilistic model over thousands of non-obfuscated Android applications and to use this probabilistic model to deobfuscate new, unseen Android APKs. The concrete focus of the paper is on reversing layout obfuscation, a popular transformation which renames key program elements such as classes, packages, and methods, thus making it difficult to understand what the program does. Concretely, the paper: (i) phrases the layout deobfuscation problem of Android APKs as structured prediction in a probabilistic graphical model, (ii) instantiates this model with a rich set of features and constraints that capture the Android setting, ensuring both semantic equivalence and high prediction accuracy, and (iii) shows how to leverage powerful inference and learning algorithms to achieve overall precision and scalability of the probabilistic predictions. + + We implemented our approach in a tool called DeGuard and used it to: (i) reverse the layout obfuscation performed by the popular ProGuard system on benign, open-source applications, (ii) predict third-party libraries imported by benign APKs (also obfuscated by ProGuard), and (iii) rename obfuscated program elements of Android malware. The experimental results indicate that DeGuard is practically effective: it recovers 79.1% of the program element names obfuscated with ProGuard, it predicts third-party libraries with accuracy of 91.3%, and it reveals string decoders and classes that handle sensitive data in Android malware. + +	deobfuscation naming
2016	Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks + + + + +	Sahil Bhatia, Rishabh Singh		We present a method for automatically generating repair feedback for syntax errors for introductory programming problems. Syntax errors constitute one of the largest classes of errors (34%) in our dataset of student submissions obtained from a MOOC course on edX. The previous techniques for generating automated feedback on programming assignments have focused on functional correctness and style considerations of student programs. These techniques analyze the program AST of the program and then perform some dynamic and symbolic analyses to compute repair feedback. Unfortunately, it is not possible to generate ASTs for student programs with syntax errors and therefore the previous feedback techniques are not applicable in repairing syntax errors. We present a technique for providing feedback on syntax errors that uses Recurrent neural networks (RNNs) to model syntactically valid token sequences. Our approach is inspired from the recent work on learning language models from Big Code (large code corpus). For a given programming assignment, we first learn an RNN to model all valid token sequences using the set of syntactically correct student submissions. Then, for a student submission with +syntax errors, we query the learnt RNN model with the prefix token sequence to predict token sequences that can fix the error by either replacing or inserting the predicted token sequence at the error location. We evaluate our technique on over 14, 000 student submissions with syntax errors. Our technique can completely repair 31.69% (4501/14203) of submissions with syntax errors and in addition partially correct 6.39% (908/14203) of the submissions. +	repair
2016	Learning Python Code Suggestion with a Sparse Pointer Network + + + + +	Avishkar Bhoopchand, Tim Rocktaschel, Earl Barr, Sebastian Riedel		To enhance developer productivity, all modern integrated development environments (IDEs) include code suggestion functionality that proposes likely next tokens at the cursor. While current IDEs work well for statically-typed languages, their reliance on type annotations means that they do not provide the same level of support for dynamic programming languages as for statically-typed languages. Moreover, suggestion engines in modern IDEs do not propose expressions or multi-statement idiomatic code. Recent work has shown that language models can improve code suggestion systems by learning from software repositories. This paper introduces a neural language model with a sparse pointer network aimed at capturing very long-range dependencies. We release a large-scale code suggestion corpus of 41M lines of Python code crawled from GitHub. On this corpus, we found standard neural language models to perform well at suggesting local phenomena, but struggle to refer to identifiers that are introduced many tokens in the past. By augmenting a neural language model with a pointer network specialized in referring to predefined classes of identifiers, we obtain a much lower perplexity and a 5 percentage points increase in accuracy for code suggestion compared to an LSTM baseline. In fact, this increase in code suggestion accuracy is due to a 13 times more accurate prediction of identifiers. Furthermore, a qualitative analysis shows this model indeed captures interesting long-range dependencies, like referring to a class member defined over 60 tokens in the past. +	language model autocomplete
2016	sk_p: a neural program corrector for MOOCs + + + + +	Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, Regina Barzilay	SPLASH	We present a novel technique for automatic program correction in MOOCs, capable of fixing both syntactic and semantic errors without manual, problem specific correction strategies. Given an incorrect student program, it generates candidate programs from a distribution of likely corrections, and checks each candidate for correctness against a test suite. + + The key observation is that in MOOCs many programs share similar code fragments, and the seq2seq neural network model, used in the natural-language processing task of machine translation, can be modified and trained to recover these fragments. + + Experiment shows our scheme can correct 29% of all incorrect submissions and out-performs state of the art approach which requires manual, problem specific correction strategies. +	repair
2016	PHOG: Probabilistic Model for Code + + + + +	Pavol Bielik, Veselin Raychev, Martin Vechev	ICML	We introduce a new generative model for code called probabilistic higher order grammar (PHOG). PHOG generalizes probabilistic context free grammars (PCFGs) by allowing conditioning of a production rule beyond the parent non-terminal, thus capturing rich contexts relevant to programs. Even though PHOG is more powerful than a PCFG, it can be learned from data just as efficiently. We trained a PHOG model on a large JavaScript code corpus and show that it is more precise than existing models, while similarly fast. As a result, PHOG can immediately benefit existing programming tools based on probabilistic models of code. +	grammar code generation language model
2016	Automatically generating features for learning program analysis heuristics + + + + +	Kwonsoo Chae, Hakjoo Oh, Kihong Heo, Hongseok Yang		We present a technique for automatically generating features for data-driven program analyses. Recently data-driven approaches for building a program analysis have been proposed, which mine existing codebases and automatically learn heuristics for finding a cost-effective abstraction for a given analysis task. Such approaches reduce the burden of the analysis designers, but they do not remove it completely; they still leave the highly nontrivial task of designing so called features to the hands of the designers. Our technique automates this feature design process. The idea is to use programs as features after reducing and abstracting them. Our technique goes through selected program-query pairs in codebases, and it reduces and abstracts the program in each pair to a few lines of code, while ensuring that the analysis behaves similarly for the original and the new programs with respect to the query. Each reduced program serves as a boolean feature for program-query pairs. This feature evaluates to true for a given program-query pair when (as a program) it is included in the program part of the pair. We have implemented our approach for three real-world program analyses. Our experimental evaluation shows that these analyses with automatically-generated features perform comparably to those with manually crafted features. +	representation
2016	Neural Code Completion + + + + +	Chang Liu, Xin Wang, Richard Shin, Joseph E. Gonzalez, Dawn Song		Code completion, an essential part of modern software development, yet can be +challenging for dynamically typed programming languages. In this paper we explore the use of neural network techniques to automatically learn code completion +from a large corpus of dynamically typed JavaScript code. We show different +neural networks that leverage not only token level information but also structural +information, and evaluate their performance on different prediction tasks. We +demonstrate that our models can outperform the state-of-the-art approach, which +is based on decision tree techniques, on both next non-terminal and next terminal +prediction tasks by 3.8 points and 0.5 points respectively. We believe that neural +network techniques can play a transformative role in helping software developers +manage the growing complexity of software systems, and we see this work as a +first step in that direction. +	autocomplete
2016	Deep API Learning + + + + +	Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim.	FSE	Developers often wonder how to implement a certain functionality (e.g., how to parse XML files) using APIs. Obtaining an API usage sequence based on an API-related natural language query is very helpful in this regard. Given a query, existing approaches utilize information retrieval models to search for matching API sequences. These approaches treat queries and APIs as bag-of-words (i.e., keyword matching or word-to-word alignment) and lack a deep understanding of the semantics of the query. + + We propose DeepAPI, a deep learning based approach to generate API usage sequences for a given natural language query. Instead of a bags-of-words assumption, it learns the +sequence of words in a query and the sequence of associated APIs. DeepAPI adapts a neural language model named RNN Encoder-Decoder. It encodes a word sequence (user query) into a fixed-length context vector, and generates an API sequence based on the context vector. We also augment the RNN Encoder-Decoder by considering the importance of individual APIs. We empirically evaluate our approach with more than 7 million annotated code snippets collected from GitHub. The results show that our approach generates largely accurate API sequences and outperforms the related approaches. + +	API search
2016	Bugram: bug detection with n-gram language models + + + + +	Song Wang, Devin Chollak, Dana Movshovitz-Attias, Lin Tan	ASE	To improve software reliability, many rule-based techniques have been proposed to infer programming rules and detect violations of these rules as bugs. These rule-based approaches often rely on the highly frequent appearances of certain patterns in a project to infer rules. It is known that if a pattern does not appear frequently enough, rules are not learned, thus missing many bugs. + + In this paper, we propose a new approach—Bugram—that leverages n-gram language models instead of rules to detect bugs. Bugram models program tokens sequentially, using the n-gram language model. Token sequences from the program are then assessed according to their probability in the learned model, and low probability sequences are marked as potential bugs. The assumption is that low probability token sequences in a program are unusual, which may indicate bugs, bad practices, or unusual/special uses of code of which developers may want to be aware. + + We evaluate Bugram in two ways. First, we apply Bugram on the latest versions of 16 open source Java projects. Results show that Bugram detects 59 bugs, 42 of which are manually verified as correct, 25 of which are true bugs and 17 are code snippets that should be refactored. Among the 25 true bugs, 23 cannot be detected by PR-Miner. We have reported these bugs to developers, 7 of which have already been confirmed by developers (4 of them have already been fixed), while the rest await confirmation. Second, we further compare Bugram with three additional graph- and rule-based bug detection tools, i.e., JADET, Tikanga, and GrouMiner. We apply Bugram on 14 Java projects evaluated in these three studies. Bugram detects 21 true bugs, at least 10 of which cannot be detected by these three tools. Our results suggest that Bugram is complementary to existing rule-based bug detection approaches. + +	defect representation
2016	Summarizing Source Code using a Neural Attention Model + + + + +	Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer	ACL	High quality source code is often paired +with high level summaries of the computation it performs, for example in code +documentation or in descriptions posted +in online forums. Such summaries are +extremely useful for applications such as +code search but are expensive to manually +author, hence only done for a small fraction of all code that is produced. In this +paper, we present the first completely data-driven approach for generating high level +summaries of source code. Our model, +CODE-NN , uses Long Short Term Memory (LSTM) networks with attention to +produce sentences that describe C# code +snippets and SQL queries. CODE-NN +is trained on a new corpus that is automatically collected from StackOverflow, +which we release. Experiments demonstrate strong performance on two tasks: +(1) code summarization, where we establish the first end-to-end learning results +and outperform strong baselines, and (2) +code retrieval, where our learned model +improves the state of the art on a recently +introduced C# benchmark by a large margin. +	summarization bimodal
2016	Latent Predictor Networks for Code Generation + + + + +	Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom	ACL	Many language generation tasks require +the production of text conditioned on both +structured and unstructured inputs. +We present a novel neural network architecture which generates an output sequence +conditioned on an arbitrary number of input functions. +Crucially, our approach +allows both the choice of conditioning +context and the granularity of generation, +for example characters or tokens, to be +marginalised, thus permitting scalable and +effective training. Using this framework, +we address the problem of generating programming code from a mixed natural language and structured specification. +We create two new data sets for this paradigm +derived from the collectible trading card +games Magic the Gathering and Hearthstone. On these, and a third preexisting +corpus, we demonstrate that marginalising multiple predictors allows our model +to outperform strong benchmarks. + +	bimodal code generation
2016	A Convolutional Attention Network for Extreme Summarization of Source Code + + + + +	Miltiadis Allamanis, Hao Peng, Charles Sutton	ICML	Attention mechanisms in neural networks have proved useful for problems in which +the input and output do not have fixed dimension. Often there exist features that +are locally translation invariant and would be valuable for directing the model’s attention, +but previous attentional architectures are not constructed to learn such features specifically. +We introduce an attentional neural network that employs convolution on the input tokens to detect +local time-invariant and long-range topical attention features in a context-dependent way. We +apply this architecture to the problem of extreme summarization of source code snippets into short, +descriptive function name-like summaries. Using those features, the model sequentially generates a +summary by marginalizing over two attention mechanisms: one that predicts the next summary token based +n the attention weights of the input tokens and another that is able to copy a code token as-is directly +into the summary. We demonstrate our convolutional attention neural network’s performance on 10 popular Java +projects showing that it achieves better performance compared to previous attentional mechanisms. +	naming summarization
2016	Convolutional Neural Networks over Tree Structures for Programming Language Processing + + + + +	Lili Mou, Ge Li, Lu Zhang, Tao Wang, Zhi Jin	AAAI	Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the +artificial intelligence community. However, different from a +natural language sentence, a program contains rich, explicit, +and complicated structural information. Hence, traditional +NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in +which a convolution kernel is designed over programs’ abstract syntax trees to capture structural information. TBCNN +is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according +to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP. +	representation grammar
2016	Gated Graph Sequence Neural Networks + + + + +	Yujia Li, Daniel Tarlow, Marc Brockschmidt, Richard Zemel	ICLR	Graph-structured data appears frequently in domains including chemistry, natural +language semantics, social networks, and knowledge bases. In this work, we study +feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify +to use gated recurrent units and modern optimization techniques and then extend +to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based +models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the +capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We +then show it achieves state-of-the-art performance on a problem from program +verification, in which subgraphs need to be described as abstract data structures. + +	GNN program analysis
2016	Mapping API Elements for Code Migration with Vector Representations + + + + +	Trong Duc Nguyen, Anh Tuan Nguyen, Tien N. Nguyen	ICSE	Mapping API elements has a significant role in software development, especially in code migration. A manual process of defining the migration is tedious and error-prone while recent approaches to automatically mine API mappings are limited to discover the mappings with textually similar APIs’ names. This leads to the low accuracy in existing migration tools.We propose an approach to automatically mine API mappings which overcomes the lexical mismatch problem. We represent an API by its usages instead of its name.To characterize an API with its context consisting of surrounding APIs in its usages, we take advantage of Word2Vec model to project the APIs of Java JDK and C# .NET into corresponding continuous vector spaces. The semantic relations among APIs will be observed in those continuous space as the geometric arrangements between their representation vectors in two vector spaces.We use a learning approach to derive the linear (e.g., rotating and scaling) transformation function between two vector spaces. Transformation function is trained from human-defined pairs of API mappings from Java to C#. To find the C# API mapping with a given Java API, we use the learned function to compute its transformed vector in the C# vector space. Then, the C# API which has the most similar vector with the transformed vector is considered as the result. Our experiment shows that for just one suggestion, we are able to correctly derive the API in C# in almost 43% of the cases. With 5 suggestions, we can correctly suggest the correct C# API in almost 3 out of 4 cases (73.2%). +	migration API
2016	Learning Programs from Noisy Data + + + + +	Veselin Raychev, Pavol lBielik, Martin Vechev, Andreas Krause	POPL	We present a new approach for learning programs from noisy +datasets. Our approach is based on two new concepts: a regularized +program generator which produces a candidate program based on a +small sample of the entire dataset while avoiding overfitting, and a +dataset sampler which carefully samples the dataset by leveraging +the candidate program’s score on that dataset. The two components +are connected in a continuous feedback-directed loop. + + We show how to apply this approach to two settings: one where +the dataset has a bound on the noise, and another without a noise +bound. The second setting leads to a new way of performing +approximate empirical risk minimization on hypotheses classes +formed by a discrete search space. + + We then present two new kinds of program synthesizers which +target the two noise settings. First, we introduce a novel regularized +bitstream synthesizer that successfully generates programs even in +the presence of incorrect examples. We show that the synthesizer +can detect errors in the examples while combating overfitting – +a major problem in existing synthesis techniques. We also show +how the approach can be used in a setting where the dataset grows +dynamically via new examples (e.g., provided by a human). + + Second, we present a novel technique for constructing statistical +code completion systems. These are systems trained on massive +datasets of open source programs, also known as “Big Code”. The +key idea is to introduce a domain specific language (DSL) over +trees and to learn functions in that DSL directly from the dataset. +These learned functions then condition the predictions made by the +system. This is a flexible and powerful technique which generalizes +several existing works as we no longer need to decide a priori on +what the prediction should be conditioned (another benefit is that +the learned functions are a natural mechanism for explaining the +prediction). As a result, our code completion system surpasses the +prediction capabilities of existing, hard-wired systems. +	code generation grammar
2016	Deep Learning Code Fragments for Code Clone Detection + + + + +	Martin White, Michele Tufano, Christopher Vendome, Denys Poshyvanyk.	ASE	Code clone detection is an important problem for software +maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These +techniques also depend on generic, handcrafted features to +represent code fragments. We introduce learning-based detection techniques where everything for representing terms +and fragments in source code is mined from the repository. +Our code analysis supports a framework, which relies on +deep learning, for automatically linking patterns mined at +the lexical level with patterns mined at the syntactic level. +We evaluated our novel learning-based approach for code +clone detection with respect to feasibility from the point +of view of software maintainers. We sampled and manually +evaluated 398 file- and 480 method-level pairs across eight +real-world Java systems; 93% of the file- and method-level +samples were evaluated to be true positives. Among the true +positives, we found pairs mapping to all four clone types. We +compared our approach to a traditional structure-oriented +technique and found that our learning-based approach detected clones that were either undetected or suboptimally +reported by the prominent tool Deckard. Our results affirm +that our learning-based approach is suitable for clone detection and a tenable technique for researchers. +	clone
2015	Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation + + + + +	Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura	ASE	Pseudo-code written in natural language can aid +the comprehension of source code in unfamiliar programming +languages. However, the great majority of source code has no +corresponding pseudo-code, because pseudo-code is redundant +and laborious to create. If pseudo-code could be generated +automatically and instantly from given source code, we could +allow for on-demand production of pseudo-code without human +effort. In this paper, we propose a method to automatically +generate pseudo-code from source code, specifically adopting the +statistical machine translation (SMT) framework. SMT, which +was originally designed to translate between two natural languages, allows us to automatically learn the relationship between +source code/pseudo-code pairs, making it possible to create a +pseudo-code generator with less human effort. In experiments, +we generated English or Japanese pseudo-code from Python +statements using SMT, and find that the generated pseudo-code +is largely accurate, and aids code understanding. +	representation bimodal grammar
2015	Suggesting Accurate Method and Class Names + + + + +	Miltiadis Allamanis, Earl T. Barr, Christian Bird, Charles Sutton	FSE	Descriptive names are a vital part of readable, and hence maintainable, code. Recent progress on automatically suggesting names for local variables tantalizes with the prospect of replicating that success with method and class names. However, suggesting names for methods and classes is much more difficult. This is because good method and class names need to be functionally descriptive, but suggesting such names requires that the model goes beyond local context. We introduce a neural probabilistic language model for source code that is specifically designed for the method naming problem. Our model learns which names are semantically similar by assigning them to locations, called embeddings, in a high-dimensional continuous space, in such a way that names with similar embeddings tend to be used in similar contexts. These embeddings seem to contain semantic information about tokens, even though they are learned only from statistical co-occurrences of tokens. Furthermore, we introduce a variant of our model +that is, to our knowledge, the first that can propose neologisms, names that have not appeared in the training corpus. We obtain state of the art results on the method, class, and even the simpler variable naming tasks. More broadly, the continuous embeddings that are learned by our model have the potential for wide application within software engineering. + +	naming
2015	A Bimodal Modelling of Source Code and Natural Language + + + + +	Miltiadis Allamanis, Daniel Tarlow, Andrew Gordon, Yi Wei	ICML	We consider the problem of building probabilistic models that jointly +model short natural language utterances and source code snippets. The +aim is to bring together recent work on statistical modelling of source +code and work on bimodal models of images and natural language. The +resulting models are useful for a variety of tasks that involve natural +language and source code. We demonstrate their performance on two +retrieval tasks: retrieving source code snippets given a natural language +query, and retrieving natural language descriptions given a source code +query (i.e., source code captioning). Experiments show there to be +promise in this direction, and that modelling the structure of source +code improves performance. +	search grammar grammar bimodal
2015	Learning Program Embeddings to Propagate Feedback on Student Code + + + + +	Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, Leonidas Guibas	ICML	Providing feedback, both assessing final work +and giving hints to stuck students, is difficult +for open-ended assignments in massive online +classes which can range from thousands to millions of students. We introduce a neural network +method to encode programs as a linear mapping +from an embedded precondition space to an embedded postcondition space and propose an algorithm for feedback at scale using these linear maps as features. We apply our algorithm +to assessments from the Code.org Hour of Code +and Stanford University’s CS1 course, where we +propagate human comments on student assignments to orders of magnitude more submissions. +	representation repair education
2015	Toward Deep Learning Software Repositories + + + + +	Martin White, Christopher Vendome, Mario Linares-Vasquez, Denys Poshyvanyk	MSR	Deep learning subsumes algorithms that automatically learn compositional representations. The ability of these +models to generalize well has ushered in tremendous advances +in many fields such as natural language processing (NLP). +Recent research in the software engineering (SE) community +has demonstrated the usefulness of applying NLP techniques to +software corpora. Hence, we motivate deep learning for software +language modeling, highlighting fundamental differences between +state-of-the-practice software language models and connectionist +models. Our deep learning models are applicable to source +code files (since they only require lexically analyzed source +code written in any programming language) and other types +of artifacts. We show how a particular deep learning model +can remember its state to effectively model sequential data, +e.g., streaming software tokens, and the state is shown to be +much more expressive than discrete tokens in a prefix. Then we +instantiate deep learning models and show that deep learning +induces high-quality models compared to n-grams and cache-based n-grams on a corpus of Java projects. We experiment +with two of the models’ hyperparameters, which govern their +capacity and the amount of context they use to inform predictions, +before building several committees of software language models +to aid generalization. Then we apply the deep learning models to +code suggestion and demonstrate their effectiveness at a real SE +task compared to state-of-the-practice models. Finally, we propose +avenues for future work, where deep learning can be brought to +bear to support model-based testing, improve software lexicons, +and conceptualize software artifacts. Thus, our work serves as +the first step toward deep learning software repositories. +	representation
2015	CACHECA: A Cache Language Model Based Code Suggestion Tool + + + + +	Christine Franks, Zhaopeng Tu, Premkumar Devanbu, Vincent Hellendoorn	ICSE	Nearly every Integrated Development Environment includes a form of code completion. The suggested completions (“suggestions”) are typically based on information available at compile time, such as type signatures and variables in scope. A statistical approach, based on estimated models of code patterns in large code corpora, has been demonstrated to be effective at predicting tokens given a context. In this demo, we present CACHECA, an Eclipse plugin that combines the native suggestions with a statistical suggestion regime. We demonstrate that a combination of the two approaches more than doubles Eclipse’s suggestion accuracy. A video demonstration is available at https://www.youtube.com/watch?v=3INk0N3JNtc. +	language model
2015	Exploring the Use of Deep Learning for Feature Location + + + + +	Christopher S. Corley, Kostadin Damevski, Nicholas A. Kraft		Deep learning models are a class of neural networks. Relative to n-gram models, deep learning models can capture more complex statistical patterns based on smaller training corpora. In this paper we explore the use of a particular deep learning model, document vectors (DVs), for feature location. DVs seem well suited to use with source code, because they both capture the influence of context on each term in a corpus and map terms into a continuous semantic space that encodes semantic relationships such as synonymy. We present preliminary results that show that a feature location technique (FLT) based on DVs can outperform an analogous FLT based on latent Dirichlet allocation (LDA) and then suggest several directions for future work on the use of deep learning models to improve developer effectiveness in feature location. +	feature location representation
2015	KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts + + + + +	Dana Movshovitz-Attias, William W. Cohen	ACL	Many existing knowledge bases (KBs), including Freebase, Yago, and NELL, rely +on a fixed ontology, given as an input +to the system, which defines the data to +be cataloged in the KB, i.e., a hierarchy of categories and relations between +them. The system then extracts facts that +match the predefined ontology. We propose an unsupervised model that jointly +learns a latent ontological structure of an +input corpus, and identifies facts from the +corpus that match the learned structure. +Our approach combines mixed membership stochastic block models and topic +models to infer a structure by jointly modeling text, a latent concept hierarchy, and +latent semantic relationships among the +entities mentioned in the text. As a case +study, we apply the model to a corpus +of Web documents from the software domain, and evaluate the accuracy of the various components of the learned ontology. +	pattern mining
2015	Graph-based Statistical Language Model for Code + + + + +	Anh Tuan Nguyen, Tien N. Nguyen	ICSE	n-gram statistical language model has been successfully applied to capture programming patterns to support code +completion and suggestion. However, the approaches using n-gram face challenges in capturing the patterns at higher levels +of abstraction due to the mismatch between the sequence nature +in n-grams and the structure nature of syntax and semantics +in source code. This paper presents GraLan, a graph-based +statistical language model and its application in code suggestion. GraLan can learn from a source code corpus and compute +the appearance probabilities of any graphs given the observed +(sub)graphs. We use GraLan to develop an API suggestion +engine and an AST-based language model, ASTLan. ASTLan +supports the suggestion of the next valid syntactic template +and the detection of common syntactic templates. Our empirical +evaluation on a large corpus of open-source projects has shown +that our engine is more accurate in API code suggestion than +the state-of-the-art approaches, and in 75% of the cases, it can +correctly suggest the API with only five candidates. ASTLan also +has high accuracy in suggesting the next syntactic template and +is able to detect many useful and common syntactic templates. +	representation language model autocomplete
2015	A User-Guided Approach to Program Analysis + + + + +	Ravi Mangal, Xin Zhang, Aditya V. Nori, Mayur Naik	FSE	Program analysis tools often produce undesirable output +due to various approximations. We present an approach +and a system Eugene that allows user feedback to guide +such approximations towards producing the desired output. +We formulate the problem of user-guided program analysis in terms of solving a combination of hard rules and soft +rules: hard rules capture soundness while soft rules capture +degrees of approximations and preferences of users. Our +technique solves the rules using an off-the-shelf solver in a +manner that is sound (satisfies all hard rules), optimal (maximally satisfies soft rules), and scales to real-world analy- +ses and programs. We evaluate Eugene on two different +analyses with labeled output on a suite of seven Java pro- +grams of size 131–198 KLOC. We also report upon a user +study involving nine users who employ Eugene to guide an +information-flow analysis on three Java micro-benchmarks. +In our experiments, Eugene significantly reduces misclassified reports upon providing limited amounts of feedback. +	program analysis
2015	On the “Naturalness” of Buggy Code + + + + +	Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, Premkumar Devanbu	ICSE	Real software, the kind working programmers produce by the kLOC +to solve real-world problems, tends to be “natural”, like speech or +natural language; it tends to be highly repetitive and predictable. +Researchers have captured this naturalness of software through statistical models and used them to good effect in suggestion engines, +porting tools, coding standards checkers, and idiom miners. This +suggests that code that appears improbable, or surprising, to a good +statistical language model is “unnatural” in some sense, and thus +possibly suspicious. In this paper, we investigate this hypothesis. We consider a large corpus of bug fix commits (ca. 8,296), +from 10 different Java projects, and we focus on its language statistics, evaluating the naturalness of buggy code and the corresponding fixes. We find that code with bugs tends to be more entropic +(i.e. unnatural), becoming less so as bugs are fixed. Focusing on +highly entropic lines is similar in cost-effectiveness to some well-known static bug finders (PMD, FindBugs) and ordering warnings +from these bug finders using an entropy measure improves the cost-effectiveness of inspecting code implicated in warnings. This suggests that entropy may be a valid language-independent and simple +way to complement the effectiveness of PMD or FindBugs, and +that search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes. + +	defect
2015	Synthesizing Java expressions from free-form queries + + + + +	Tihomir Gvero, Viktor Kuncak	OOPSLA	We present a new code assistance tool for integrated development environments. Our system accepts as input free-form queries containing a mixture of English and Java, and produces Java code expressions that take the query into account and respect syntax, types, and scoping rules of Java, as well as statistical usage patterns. In contrast to solutions based on code search, the results returned by our tool need not directly correspond to any previously seen code fragment. As part of our system we have constructed a probabilistic context free grammar for Java constructs and library invocations, as well as an algorithm that uses a customized natural language processing tool chain to extract information from free-form text queries. We present the results on a number of examples showing that our technique (1) often produces the expected code fragments, (2) tolerates much of the flexibility of natural language, and (3) can repair incorrect Java expressions that use, for example, the wrong syntax or missing arguments. +	synthesis code generation bimodal
2015	Predicting Program Properties from “Big Code” + + + + +	Veselin Raychev, Martin Vechev, Andreas Krause	POPL	We present a new approach for predicting program properties from +massive codebases (aka “Big Code”). Our approach first learns a +probabilistic model from existing data and then uses this model to +predict properties of new, unseen programs. + + The key idea of our work is to transform the input program into +a representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic +graphical models such as conditional random fields (CRFs) in order +to perform joint prediction of program properties. + + As an example of our approach, we built a scalable prediction +engine called JSNICE 1 for solving two kinds of problems in the +context of JavaScript: predicting (syntactic) names of identifiers +and predicting (semantic) type annotations of variables. Experimentally, JSNICE predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of the +cases. In the first week since its release, JSN ICE was used by more +than 30,000 developers and in only few months has become a popular tool in the JavaScript developer community. + + By formulating the problem of inferring program properties as +structured prediction and showing how to perform both learning +and inference in this context, our work opens up new possibilities +for attacking a wide range of difficult problems in the context of +“Big Code” including invariant generation, de-compilation, synthesis and others. +	program analysis naming types deobfuscation
2015	Irish: A Hidden Markov Model to detect coded information islands in free text + + + + +	Luigi Cerulo, Michele Ceccarelli, Massimiliano Di Penta, Gerardo Canfora	Science of Computer Programming	Developers’ communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can +be used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers’ communication can be useful to support +several software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts. + + We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74% and 99%; this is in line with existing approaches which, differently from Irish, require specific expertise for the definition of regular expressions or grammars. + +	information extraction
2015	Visualizing and Understanding Recurrent Networks + + + + +	Andrej Karpathy, Justin Johnson, Li Fei-Fei		Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful +applications in a wide range of machine learning problems that involve sequential +data. However, while LSTMs provide exceptional results in practice, the source +of their performance and their limitations remain rather poorly understood. Using character-level language models as an interpretable testbed, we aim to bridge +this gap by providing an analysis of their representations, predictions and error +types. In particular, our experiments reveal the existence of interpretable cells that +keep track of long-range dependencies such as line lengths, quotes and brackets. +Moreover, our comparative analysis with finite horizon n-gram models traces the +source of the LSTM improvements to long-range structural dependencies. Finally, +we provide analysis of the remaining errors and suggests areas for further study. + +	language model code generation
2015	Will they like this? Evaluating Code Contributions With Language Models + + + + +	Vincent J. Hellendoorn, Premkumar Devanbu, Alberto Bacchelli	MSR	Popular open-source software projects receive and +review contributions from a diverse array of developers, many +of whom have little to no prior involvement with the project. A +recent survey reported that reviewers consider conformance to +the project’s code style to be one of the top priorities when evaluating code contributions on Github. We propose to quantitatively +evaluate the existence and effects of this phenomenon. To this aim +we use language models, which were shown to accurately capture +stylistic aspects of code. We find that rejected changesets do +contain code significantly less similar to the project than accepted +ones; furthermore, the less similar changesets are more likely +to be subject to thorough review. Armed with these results we +further investigate whether new contributors learn to conform to +the project style and find that experience is positively correlated +with conformance to the project’s code style. +	review language model
2015	Using Machine Translation for Converting Python 2 to Python 3 Code + + + + +	Karan Aggarwal, Mohammad Salameh, Abram Hindle		In this paper, we have tried to use Statistical machine translation in order to convert Python 2 code to Python 3 code. We use data from two projects and achieve a high BLEU score. We also investigate the cross-project training and testing to analyze the errors so as to ascertain differences with previous case. We have described a pilot study on modeling programming languages as natural language to build translation models on the lines of natural languages. This can be further worked on to translate between versions of a programming language or cross-programming-languages code translation. +	migration
2015	Intelligent Code Completion with Bayesian Networks + + + + +	Sebastian Proksch, Johannes Lerch, Mira Mezini	TSE	Code completion is an integral part of modern Integrated Development Environments (IDEs). Developers +often use it to explore Application Programming Interfaces (APIs). It is also useful to reduce the required +amount of typing and to help avoid typos. Traditional code completion systems propose all type-correct +methods to the developer. Such a list is often very long with many irrelevant items. More intelligent code +completion systems have been proposed in prior work to reduce the list of proposed methods to relevant +items. + + This work extends one of these existing approaches, the Best Matching Neighbor (BMN) algorithm. We +introduce Bayesian networks as an alternative underlying model, use additional context information for +more precise recommendations, and apply clustering techniques to improve model sizes. We compare our +new approach, Pattern-based Bayesian Networks (PBN), to the existing BMN algorithm. We extend previously used evaluation methodologies and, in addition to prediction quality, we also evaluate model size and +inference speed. + + Our results show that the additional context information we collect improves prediction quality, especially +for queries that do not contain method calls. We also show that PBN can obtain comparable prediction +quality to BMN, while model size and inference speed scale better with large input sizes. +	autocomplete
2015	Learning a Strategy for Adapting a Program Analysis via Bayesian Optimisation + + + + +	Hakjoo Oh, Hongseok Yang, Kwangkeun Yi.	OOPSLA	Building a cost-effective static analyser for real-world programs is still regarded an art. One key contributor to this +grim reputation is the difficulty in balancing the cost and the +precision of an analyser. An ideal analyser should be adap- +tive to a given analysis task, and avoid using techniques that +unnecessarily improve precision and increase analysis cost. +However, achieving this ideal is highly nontrivial, and it requires a large amount of engineering efforts. + + In this paper we present a new approach for building +an adaptive static analyser. In our approach, the analyser +includes a sophisticated parameterised strategy that decides, for each part of a given program, whether to apply +a precision-improving technique to that part or not. We +present a method for learning a good parameter for such +a strategy from an existing codebase via Bayesian optimisation. The learnt strategy is then used for new, unseen programs. Using our approach, we developed partially flow- +and context-sensitive variants of a realistic C static analyser. +The experimental results demonstrate that using Bayesian +optimisation is crucial for learning from an existing codebase. Also, they show that among all program queries that +require flow- or context-sensitivity, our partially flow- and +context-sensitive analysis answers the 75% of them, while +increasing the analysis cost only by 3.3x of the baseline +flow- and context-insensitive analysis, rather than 40x or +more of the fully sensitive version. +	program analysis
2015	Products, Developers, and Milestones: How Should I Build My N-Gram Language Model + + + + +	Juliana Saraiva, Christian Bird, Thomas Zimmermann	FSE	Recent work has shown that although programming languages en- +able source code to be rich and complex, most code tends to be +repetitive and predictable. The use of natural language processing +(NLP) techniques applied to source code such as n-gram language +models show great promise in areas such as code completion, aiding impaired developers, and code search. In this paper, we address +three questions related to different methods of constructing lan- +guage models in an industrial context. Specifically, we ask: (1) Do +application specific, but smaller language models perform better +than language models across applications? (2) Are developer specific language models effective and do they differ depending on +what parts of the codebase a developer is working in? (3) Finally, +do language models change over time, i.e., does a language model +from early development model change later on in development? +The answers to these questions enable techniques that make use of +programming language models in development to choose the model +training corpus more effectively. + + We evaluate these questions by building 28 language models across +developers, time periods, and applications within Microsoft Office +and present the results in this paper. We find that developer and +application specific language models perform better than models +from the entire codebase, but that temporality has little to no effect +on language model performance. +	language model
2015	Aroma: code recommendation via structural code search + + + + +	Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, Satish Chandra	PACMPL	Programmers often write code that has similarity to existing code written somewhere. A tool that could help programmers to search such similar code would be immensely useful. Such a tool could help programmers to extend partially written code snippets to completely implement necessary functionality, help to discover extensions to the partial code which are commonly included by other programmers, help to cross-check against similar code written by other programmers, or help to add extra code which would fix common mistakes and errors. We propose Aroma, a tool and technique for code recommendation via structural code search. Aroma indexes a huge code corpus including thousands of open-source projects, takes a partial code snippet as input, searches the corpus for method bodies containing the partial code snippet, and clusters and intersects the results of the search to recommend a small set of succinct code snippets which both contain the query snippet and appear as part of several methods in the corpus. We evaluated Aroma on 2000 randomly selected queries created from the corpus, as well as 64 queries derived from code snippets obtained from Stack Overflow, a popular website for discussing code. We implemented Aroma for 4 different languages, and developed an IDE plugin for Aroma. Furthermore, we conducted a study where we asked 12 programmers to complete programming tasks using Aroma, and collected their feedback. Our results indicate that Aroma is capable of retrieving and recommending relevant code snippets efficiently. +	search
2015	OverCode: visualizing variation in student solutions to programming problems at scale + + + + +	Elena L. Glassman, Jeremy Scott, Rishabh Singh, Philip J. Guo, Robert C. Miller		In MOOCs, a single programming exercise may produce thousands of solutions from learners. Understanding solution variation is important for providing appropriate feedback to students at scale. The wide variation among these solutions can be a source of pedagogically valuable examples and can be used to refine the autograder for the exercise by exposing corner cases. We present OverCode, a system for visualizing and exploring thousands of programming solutions. OverCode uses both static and dynamic analysis to cluster similar solutions, and lets teachers further filter and cluster solutions based on different criteria. We evaluated OverCode against a nonclustering baseline in a within-subjects study with 24 teaching assistants and found that the OverCode interface allows teachers to more quickly develop a high-level view of students’ understanding and misconceptions, and to provide feedback that is relevant to more students’ solutions. +	repair
2015	NIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model + + + + +	Abhishek Sharma, Yuan Tian, David Lo	SANER	Twitter is one of the most widely used social media +platforms today. It enables users to share and view short 140-character messages called “tweets”. About 284 million active +users generate close to 500 million tweets per day. Such rapid +generation of user generated content in large magnitudes results +in the problem of information overload. Users who are interested +in information related to a particular domain have limited means +to filter out irrelevant tweets and tend to get lost in the huge +amount of data they encounter. A recent study by Singer et +al. found that software developers use Twitter to stay aware of +industry trends, to learn from others, and to network with other +developers. However, Singer et al. also reported that developers +often find Twitter streams to contain too much noise which is a +barrier to the adoption of Twitter. In this paper, to help developers +cope with noise, we propose a novel approach named NIRMAL, +which automatically identifies software relevant tweets from a +collection or stream of tweets. Our approach is based on language +modeling which learns a statistical model based on a training +corpus (i.e., set of documents). We make use of a subset of posts +from StackOverflow, a programming question and answer site, as +a training corpus to learn a language model. A corpus of tweets +was then used to test the effectiveness of the trained language +model. The tweets were sorted based on the rank the model +assigned to each of the individual tweets. The top 200 tweets +were then manually analyzed to verify whether they are software +related or not, and then an accuracy score was calculated. The +results show that decent accuracy scores can be achieved by +various variants of NIRMAL, which indicates that NIRMAL can +effectively identify software related tweets from a huge corpus of +tweets. +	information extraction
2014	Learning to Execute + + + + +	Wojciech Zaremba, Ilya Sutskever		Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequence-to-sequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks. We consider a simple class of programs that can be evaluated with a single left-to-right pass using constant memory. Our main result is that LSTMs can learn to map the character-level representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks’ performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an LSTM to add two 9-digit numbers with 99% accuracy. +	execution representation
2014	Learning Natural Coding Conventions + + + + +	Miltiadis Allamanis, Earl T. Barr, Christian Bird, Charles Sutton	FSE	Every programmer has a characteristic style, ranging from preferences +about identifier naming to preferences about object relationships and +design patterns. Coding conventions define a consistent syntactic style, +fostering readability and hence maintainability. When collaborating, +programmers strive to obey a project’s coding conventions. However, +one third of reviews of changes contain feedback about coding conventions, +indicating that programmers do not always follow them and that project +members care deeply about adherence. Unfortunately, programmers are +often unaware of coding conventions because inferring them requires a +global view, one that aggregates the many local decisions programmers +make and identifies emergent consensus on style. We present Naturalize, +a framework that learns the style of a codebase, and suggests revisions +to improve stylistic consistency. Naturalize builds on recent work in +applying statistical natural language processing to source code. We +apply Naturalize to suggest natural identifier names and formatting +conventions. We present four tools focused on ensuring natural code +during development and release management, including code review. +Naturalize achieves 94% accuracy in its top suggestions for identifier +names. We used Naturalize to generate 18 patches for 5 open source +projects: 14 were accepted. +	naming language model style
2014	Phrase-Based Statistical Translation of Programming Languages + + + + +	S. Karaivanov, Veselin Raychev, Martin Vechev	Onward	Phrase-based statistical machine translation approaches have been +highly successful in translating between natural languages and are +heavily used by commercial systems (e.g. Google Translate). + + The main objective of this work is to investigate the applicability of +these approaches for translating between programming languages. +Towards that, we investigated several variants of the phrase-based +translation approach: i) a direct application of the approach to +programming languages, ii) a novel modification of the approach +to incorporate the grammatical structure of the target programming +language (so to avoid generating target programs which do not +parse), and iii) a combination of ii) with custom rules added to +improve the quality of the translation. + + To experiment with the above systems, we investigated machine +translation from C# to Java. For the training, which takes about +60 hours, we used a parallel corpus of 20, 499 C#-to-Java method +translations. We then evaluated each of the three systems above by +translating 1,000 C# methods. Our experimental results indicate +that with the most advanced system, about 60% of the translated +methods compile (the top ranked) and out of a random sample of 50 +correctly compiled methods, 68% (34 methods) were semantically +equivalent to the reference solution. +	migration code generation
2014	Structured Generative Models of Natural Source Code + + + + +	Chris J. Maddison, Daniel Tarlow	ICML	We study the problem of building generative +models of natural source code (NSC); that is, +source code written by humans and meant to +be understood by humans. Our primary con- +tribution is to describe new generative models +that are tailored to NSC. The models are based +on probabilistic context free grammars (PCFGs) +and neuro-probabilistic language models (Mnih +& Teh, 2012), which are extended to incorporate +additional source code-specific structure. These +models can be efficiently trained on a corpus +of source code and outperform a variety of less +structured baselines in terms of predictive log +likelihoods on held-out data. + +	language model code generation grammar grammar
2014	A system to grade computer programming skills using machine learning + + + + +	Shashank Srikant, Varun Aggarwal	KDD	The automatic evaluation of computer programs is a nascent area of research with a potential for large-scale impact. Extant program assessment systems score mostly based on the number of test-cases passed, providing no insight into the competency of the programmer. In this paper, we present a system to grade computer programs automatically. In addition to grading a program on its programming practices and complexity, the key kernel of the system is a machine-learning based algorithm which determines closeness of the logic of the given program to a correct program. This algorithm uses a set of highly-informative features, derived from the abstract representations of a given program, that capture the program’s functionality. These features are then used to learn a model to grade the programs, which are built against evaluations done by experts. We show that the regression models provide much better grading than the ubiquitous test-case-pass based grading and rivals the grading accuracy of other open-response problems such as essay grading . We also show that our novel features add significant value over and above basic keyword/expression count features. In addition to this, we propose a novel way of posing computer-program grading as a one-class modeling problem and report encouraging preliminary results. We show the value of the system through a case study in a real-world industrial deployment. To the best of the authors’ knowledge, this is the first time a system using machine learning has been developed and used for grading programs. The work is timely with regard to the recent boom in Massively Online Open Courseware (MOOCs), which promises to produce a significant amount of hand-graded digitized data. +	education
2014	Mining Idioms from Source Code + + + + +	Miltiadis Allamanis, Charles Sutton	FSE	We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic purpose. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present Haggis, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply Haggis to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management. +	pattern mining grammar grammar
2014	Syntax Errors Just Aren’t Natural: Improving Error Reporting with Language Models + + + + +	Joshua Charles Campbell, Abram Hindle, José Nelson Amaral	MSR	A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in +many errors reported throughout the file. We seek to find the actual source of these syntax errors by relying on the consistency of software: valid source code is usually repetitive and unsurprising. We exploit this consistency by constructing a simple N-gram language model of lexed source code tokens. We implemented an automatic Java syntax-error locator using the corpus of the project itself and evaluated its performance on mutated source code from several projects. Our tool, trained on the past versions of a project, can effectively augment the syntax error locations produced by the native compiler. Thus we provide a methodology and tool that exploits the naturalness of software source code to detect syntax errors alongside the parser. +	repair language model
2014	Building Program Vector Representations for Deep Learning + + + + +	Hao Peng, Lili Mou, Ge Li, Yuxuan Liu, Lu Zhang, Zhi Jin.	International Conference on Knowledge Science, Engineering and Management	Deep learning has made significant breakthroughs +in various fields of artificial intelligence. Advantages of deep +learning include the ability to capture highly complicated features, weak involvement of human engineering, etc. However, +it is still virtually impossible to use deep learning to analyze +programs since deep architectures cannot be trained effectively +with pure back propagation. In this pioneering paper, we propose +the “coding criterion” to build program vector representations, +which are the premise of deep learning for program analysis. Our +representation learning approach directly makes deep learning a +reality in this new field. We evaluate the learned vector representations both qualitatively and quantitatively. We conclude, based +on the experiments, the coding criterion is successful in building +program representations. To evaluate whether deep learning +is beneficial for program analysis, we feed the representations +to deep neural networks, and achieve higher accuracy in the +program classification task than “shallow” methods, such as +logistic regression and the support vector machine. This result +confirms the feasibility of deep learning to analyze programs. It +also gives primary evidence of its success in this new field. We +believe deep learning will become an outstanding technique for +program analysis in the near future. + +	representation grammar
2014	Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code + + + + +	Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen	ASE	Prior research shows that directly applying phrase-based SMT on lexical tokens to migrate Java to C# produces +much semantically incorrect code. A key limitation is the use of +sequences in phrase-based SMT to model and translate source +code with well-formed structures. We propose mppSMT, a divideand-conquer technique to address that with novel training and migration algorithms using phrase-based SMT in three phases. First, +mppSMT treats a program as a sequence of syntactic units and +maps/translates such sequences in two languages to one another. +Second, with a syntax-directed fashion, it deals with the tokens +within syntactic units by encoding them with semantic symbols to +represent their data and token types. This encoding via semantic +symbols helps better migration of API usages. Third, the lexical +tokens corresponding to each sememe are mapped or migrated. +The resulting sequences of tokens are merged together to form +the final migrated code. Such divide-and-conquer and syntax-direction strategies enable phrase-based SMT to adapt well to +syntactical structures in source code, thus, improving migration +accuracy. Our empirical evaluation on several real-world systems +shows that 84.8–97.9% and 70–83% of the migrated methods are +syntactically and semantically correct, respectively. 26.3–51.2% +of total migrated methods are exactly matched to the human-written C# code in the oracle. Compared to Java2CSharp, a rule-based migration tool, it achieves higher semantic accuracy from +6.6–57.7% relatively. Importantly, it does not require manual +labeling for training data or manual definition of rules. +	migration
2014	On the Localness of Software + + + + +	Zhaopeng Tu, Zhendong Su, Premkumar Devanbu	FSE	The n-gram language model, which has its roots in statistical natural +language processing, has been shown to successfully capture the +repetitive and predictable regularities (“naturalness”) of source code, +and help with tasks such as code suggestion, porting, and designing +assistive coding devices. However, we show in this paper that this +natural-language-based model fails to exploit a special property of +source code: localness. We find that human-written programs are +localized: they have useful local regularities that can be captured +and exploited. We introduce a novel cache language model that +consists of both an n-gram and an added “cache” component to +exploit localness. We show empirically that the additional cache +component greatly improves the n-gram approach by capturing +the localness of software, as measured by both cross-entropy and +suggestion accuracy. Our model’s suggestion accuracy is actually +comparable to a state-of-the-art, semantically augmented language +model; but it is simpler and easier to implement. Our cache language +model requires nothing beyond lexicalization, and thus is applicable +to all programming languages. +	language model
2014	Using Web Corpus Statistics for Program Analysis + + + + +	Chun-Hung Hsiao, Michael Cafarella, Satish Narayanasamy	OOPSLA	Several program analysis tools—such as plagiarism detection and bug finding—rely on knowing a piece of code’s +relative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have +an identical simple loop counter test, but should report programs that share more distinctive code. Traditional program +analysis techniques (e.g., finding data and control dependencies) are useful, but do not say how surprising or common +a line of code is. Natural language processing researchers +have encountered a similar problem and addressed it using +an n-gram model of text frequency, derived from statistics +computed over text corpora. + + We propose and compute an n-gram model for programming languages, computed over a corpus of 2.8 million +JavaScript programs we downloaded from the Web. In contrast to previous techniques, we describe a code n-gram as +a subgraph of the program dependence graph that contains +all nodes and edges reachable in n steps from the statement. +We can count n-grams in a program and count the frequency +of n-grams in the corpus, enabling us to compute tf-idf-style +measures that capture the differing importance of different +lines of code. We demonstrate the power of this approach by +implementing a plagiarism detector with accuracy that beats +previous techniques, and a bug-finding tool that discovered +over a dozen previously unknown bugs in a collection of real +deployed programs. +	defect
2014	Code Completion with Statistical Language Models + + + + +	Veselin Raychev, Martin Vechev, Eran Yahav	PLDI	We address the problem of synthesizing code completions for programs using APIs. Given a program with holes, we synthesize completions for holes with the most likely sequences of method calls. + + Our main idea is to reduce the problem of code completion to +a natural-language processing problem of predicting probabilities +of sentences. We design a simple and scalable static analysis that +extracts sequences of method calls from a large codebase, and +index these into a statistical language model. We then employ +the language model to find the highest ranked sentences, and use +them to synthesize a code completion. Our approach is able to +synthesize sequences of calls across multiple objects together with +their arguments. + + Experiments show that our approach is fast and effective. Virtually all computed completions typecheck, and the desired completion appears in the top 3 results in 90% of the cases. +	language model autocomplete code generation
2014	NLyze: Interactive Programming by Natural Language for SpreadSheet Data Analysis and Manipulation + + + + +	Sumit Gulwani, Mark Marron	SIGMOD	Millions of computer end users need to perform tasks over tabular spreadsheet data, yet lack the programming knowledge to do such tasks automatically. This paper describes +the design and implementation of a robust natural language +based interface to spreadsheet programming. Our methodology involves designing a typed domain-specific language +(DSL) that supports an expressive algebra of map, filter, reduce, join, and formatting capabilities at a level of abstraction appropriate for non-expert users. The key algorithmic +component of our methodology is a translation algorithm +for converting a natural language specification in the context of a given spreadsheet to a ranked set of likely programs +in the DSL. The translation algorithm leverages the spreadsheet spatial and temporal context to assign interpretations +to specifications with implicit references, and is thus robust +to a variety of ways in which end users can express the same +task. The translation algorithm builds over ideas from keyword programming and semantic parsing to achieve both +high precision and high recall. We implemented the system +as an Excel add-in called NLyze that supports a rich user +interaction model including annotating the user’s natural +language specification and explaining the synthesized DSL +programs by paraphrasing them into structured English. We +collected a total of 3570 English descriptions for 40 spreadsheet tasks and our system was able to generate the intended +interpretation as the top candidate for 94% (97% for the top +3) of those instances. + +	code generation bimodal synthesis
2014	Statistical Learning Approach for Mining API Usage Mappings for Code Migration + + + + +	Anh Tuan Nguyen, Hoan Anh Nguyen, Tung Thanh Nguyen, Tien N. Nguyen	ASE	The same software product nowadays could appear in multiple platforms and devices. To address business needs, software companies +develop a software product in a programming language and then +migrate it to another one. To support that process, semi-automatic +migration tools have been proposed. However, they require users +to manually define the mappings between the respective APIs of +the libraries used in two languages. To reduce such manual effort, +we introduce StaMiner, a novel data-driven approach that statistically learns the mappings between APIs from the corpus of the +corresponding client code of the APIs in two languages Java and +C#. Instead of using heuristics on the textual or structural similarity +between APIs in two languages to map API methods and classes +as in existing mining approaches, StaMiner is based on a statistical +model that learns the mappings in such a corpus and provides mappings for APIs with all possible arities. Our empirical evaluation +on several projects shows that StaMiner can detect API usage mappings with higher accuracy than a state-of-the-art approach. With +the resulting API mappings mined by StaMiner, Java2CSharp, an +existing migration tool, could achieve a higher level of accuracy. +	migration API
2013	A Statistical Semantic Language Model for Source Code + + + + +	Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, Tien N. Nguyen	FSE	Recent research has successfully applied the statistical n-gram language model to show that source code exhibits a +good level of repetition. The n-gram model is shown to have +good predictability in supporting code suggestion and completion. However, the state-of-the-art n-gram approach to +capture source code regularities/patterns is based only on +the lexical information in a local context of the code units. +To improve predictability, we introduce SLAMC, a novel statistical semantic language model for source code. It incorporates semantic information into code tokens and models the +regularities/patterns of such semantic annotations, called sememes, rather than their lexemes. It combines the local context in semantic n-grams with the global technical concerns/functionality into an n-gram topic model, together with pairwise associations of program elements. Based on SLAMC, +we developed a new code suggestion method, which is empirically evaluated on several projects to have relatively 18–68% +higher accuracy than the state-of-the-art approach. + +	language model
2013	Structured Statistical Syntax Tree Prediction + + + + +	Cyrus Omar	SPLASH	Statistical models of source code can be used to improve +code completion systems, assistive interfaces, and code +compression engines. We are developing a statistical model +where programs are represented as syntax trees, rather than +simply a stream of tokens. Our model, initially for the Java +language, combines corpus data with information about syntax, types and the program context. We tested this model +using open source code corpuses and find that our model +is significantly more accurate than the current state of the +art, providing initial evidence for our claim that combining +structural and statistical information is a fruitful strategy. +	language model grammar
2013	Lexical Statistical Machine Translation for Language Migration + + + + +	Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen	FSE	Prior research has shown that source code also exhibits naturalness, i.e. it is written by humans and is likely to be +repetitive. The researchers also showed that the n-gram language model is useful in predicting the next token in a source +file given a large corpus of existing source code. In this paper, we investigate how well statistical machine translation +(SMT) models for natural languages could help in migrating source code from one programming language to another. +We treat source code as a sequence of lexical tokens and +apply a phrase-based SMT model on the lexemes of those +tokens. Our empirical evaluation on migrating two Java +projects into C# showed that lexical, phrase-based SMT +could achieve high lexical translation accuracy ( BLEU from +81.3-82.6%). Users would have to manually edit only 11.9-15.8% of the total number of tokens in the resulting code to +correct it. However, a high percentage of total translation +methods (49.5-58.6%) is syntactically incorrect. Therefore, +our result calls for a more program-oriented SMT model that +is capable of better integrating the syntactic and semantic +information of a program to support language migration. +	migration API
2013	Using Semantic Unification to Generate Regular Expressions from Natural Language + + + + +	Nate Kushman, Regina Barzilay	NAACL	We consider the problem of translating natural language text queries into regular expressions which represent their meaning. The mismatch in the level of abstraction between the natural language representation and the regular expression representation make this a novel and challenging problem. However, a given regular expression can be written in many semantically equivalent forms, and we exploit this flexibility to facilitate translation by finding a form which more directly corresponds to the natural language. We evaluate our technique on a set of natural language queries and their associated regular expressions which we gathered from Amazon Mechanical Turk. Our model substantially outperforms a state-of-the-art semantic parsing baseline, yielding a 29% absolute improvement in accuracy. +	bimodal code generation
2013	A Machine Learning Framework for Programming by Example + + + + +	Aditya Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, Adam Kalai	ICML	Learning programs is a timely and interesting challenge. In Programming by Example +(PBE), a system attempts to infer a program +from input and output examples alone, by +searching for a composition of some set of +base functions. We show how machine learning can be used to speed up this seemingly +hopeless search problem, by learning weights +that relate textual features describing the +provided input-output examples to plausible +sub-components of a program. This generic +learning framework lets us address problems +beyond the scope of earlier PBE systems. +Experiments on a prototype implementation +show that learning improves search and ranking on a variety of text processing tasks found +on help forums. +	code generation
2013	A Hidden Markov Model to Detect Coded Information Islands in Free Text + + + + +	Luigi Cerulo, Michele Ceccarelli, Massimiliano Di Penta, Gerardo Canfora	SCAM	Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of +source code and natural language, unstructured text. + + In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens — e.g., words, language keywords, numbers, parentheses, punctuation marks, etc. — observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language. + + We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers. + +	information extraction
2013	Natural Language Models for Predicting Programming Comments + + + + +	Dana Movshovitz-Attias, William W. Cohen	ACL	Statistical language models have successfully been used to describe and analyze +natural language documents. Recent work +applying language models to programming languages is focused on the task +of predicting code, while mainly ignoring +the prediction of programmer comments. +In this work, we predict comments from +JAVA source files of open source projects, +using topic models and n-grams, and we +analyze the performance of the models +given varying amounts of background data +on the project being predicted. We evaluate models on their comment-completion +capability in a setting similar to code completion tools built into standard code +editors, and show that using a comment +completion tool can save up to 47% of the +comment typing. + +	bimodal documentation summarization
2013	A Study of Repetitiveness of Code Changes in Software Evolution + + + + +	Hoan Anh Nguyen, Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen, and Hridesh Rajan	ASE	In this paper, we present a large-scale study of +repetitiveness of code changes in software evolution. We collected +a large data set of 2,841 Java projects, with 1.7 billion source lines +of code (SLOC) at the latest revisions, 1.8 million code change +revisions (0.4 million fixes), 6.2 million changed files, and 2.5 +billion changed SLOCs. A change is considered repeated within +or cross-project if it matches another change having occurred +in the history of the project or another project, respectively. We +report the following important findings. First, repetitiveness of +changes could be as high as 70–100% at small sizes and decreases +exponentially as size increases. Second, repetitiveness is higher +and more stable in the cross-project setting than in the project-within one. Third, fixing changes repeat similarly to general +changes. Importantly, learning code changes and recommending +them in software evolution is beneficial with accuracy for top-1 +recommendation of over 30% and top-3 of nearly 35%. Repeated +fixing changes could also be useful for automatic program repair. + +	edit
2013	Mining Source Code Repositories at Massive Scale Using Language Modeling + + + + +	Miltiadis Allamanis, Charles Sutton	MSR	The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new “lens” for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a program’s core logic based solely on general information theoretic criteria. +	language model
2012	On the Naturalness of Software + + + + +	Abram Hindle, Earl T. Barr, Mark Gabel, Zhendong Su, Premkumar Devanbu	ICSE	Natural languages like English are rich, complex, +and powerful. The highly creative and graceful use of languages +like English and Tamil, by masters like Shakespeare and +Avvaiyar, can certainly delight and inspire. But in practice, +given cognitive constraints and the exigencies of daily life, most +human utterances are far simpler and much more repetitive +and predictable. In fact, these utterances can be very usefully +modeled using modern statistical methods. This fact has led +to the phenomenal success of statistical approaches to speech +recognition, natural language translation, question-answering, +and text mining and comprehension. + + We begin with the conjecture that most software is also +natural, in the sense that it is created by humans at work, +with all the attendant constraints and limitations—and thus, +like natural language, it is also likely to be repetitive and +predictable. We then proceed to ask whether a) code can +be usefully modeled by statistical language models and b) +such models can be leveraged to support software engineers. +Using the widely adopted n-gram model, we provide empirical +evidence supportive of a positive answer to both these questions. +We show that code is also very repetitive, and in fact even more +so than natural languages. As an example use of the model, +we have developed a simple code completion engine for Java +that, despite its simplicity, already improves Eclipse’s built-in +completion capability. We conclude the paper by laying out a +vision for future research in this area. + +	language model autocomplete
2009	Learning from Examples to Improve Code Completion Systems + + + + +	Marcel Bruch, Martin Monperrus, Mira Mezini.	ESEC/FSE	The suggestions made by current IDE’s code completion features are based exclusively on static type system of the programming language. As a result, often proposals are made which are irrelevant for a particular working context. Also, these suggestions are ordered alphabetically rather than by their relevance in a particular context. In this paper, we present intelligent code completion systems that learn from existing code repositories. We have implemented three such systems, each using the information contained in +repositories in a different way. We perform a large-scale quantitative evaluation of these systems, integrate the best performing one into Eclipse, and evaluate the latter also by a user study. Our experiments give evidence that intelligent code completion systems which learn from examples significantly outperform mainstream code completion systems in terms of the relevance of their suggestions and thus have the potential to enhance developers’ productivity. +	autocomplete
{{ publication.year }}	{{publication.title}} +	2007	A Factor Graph Model for Software Bug Finding - - + +	{{ publication.authors }}	{{ publication.conference }}	{{ publication.content }}	{% for tag in publication.tags %}{{ tag }} {% endfor %}	Ted Kremenek, Andrew Y. Ng, Dawson R. Engler.	IJCAI	Automatic tools for finding software errors require +knowledge of the rules a program must obey, or +“specifications,” before they can identify bugs. We +present a method that combines factor graphs and +static program analysis to automatically infer specifications directly from programs. We illustrate the +approach on inferring functions in C programs that +allocate and release resources, and evaluate the approach on three codebases: SDL, OpenSSH, and +the OS kernel for Mac OS X (XNU). The inferred +specifications are highly accurate and with them we +have discovered numerous bugs. + +	program analysis

+ +

+

Graph4Code: A Machine Interpretable Knowledge Graph for Code

+

Ibrahim Abdelaziz, Julian Dolby, James P. McCusker, Kavitha Srinivas. 2020

+

+ + [ArXiV] + + [Website] + + + +
+ + dataset + +

+

Knowledge graphs have proven extremely useful in powering diverse applications in semantic search and natural language understanding. Graph4Code is a knowledge graph about program code that can similarly power diverse applications such as program search, code understanding, refactoring, bug detection, and code automation. The graph uses generic techniques to capture the semantics of Python code: the key nodes in the graph are classes, functions and methods in popular Python modules. Edges indicate function usage (e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). We make extensive use of named graphs in RDF to make the knowledge graph extensible by the community. We describe a set of generic extraction techniques that we applied to over 1.3M Python files drawn from GitHub, over 2,300 Python modules, as well as 47M forum posts to generate a graph with over 2 billion triples. We also provide a number of initial use cases of the knowledge graph in code assistance, enforcing best practices, debugging and type inference. The graph and all its artifacts are available to the community for use.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

+

Rajas Agashe, Srinivasan Iyer, Luke Zettlemoyer. 2019

+

+ + [ArXiV] + + [Dataset] + + + +
+ + dataset + + bimodal + +

+

Interactive programming with interleaved code snippet cells and natural language markdown is recently gaining popularity in the form of Jupyter notebooks, which accelerate prototyping and collaboration. To study code generation conditioned on a long context history, we present JuICe, a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data. Using JuICe, we train models for two tasks: (1) generation of the API call sequence in a code cell, and (2) full code cell generation, both conditioned on the NL-Code history up to a particular code cell. Experiments using current baseline code generation models show that both context and distant supervision aid in generation, and that the dataset is challenging for current systems.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Using Machine Translation for Converting Python 2 to Python 3 Code

+

Karan Aggarwal, Mohammad Salameh, Abram Hindle. 2015

+

+ + + +
+ + migration + +

+

In this paper, we have tried to use Statistical machine translation in order to convert Python 2 code to Python 3 code. We use data from two projects and achieve a high BLEU score. We also investigate the cross-project training and testing to analyze the errors so as to ascertain differences with previous case. We have described a pilot study on modeling programming languages as natural language to build translation models on the lines of natural languages. This can be further worked on to translate between versions of a programming language or cross-programming-languages code translation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context

+

Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, Sriram Rajamani. NeurIPS 2023

+

+ + [ArXiV] + + [NeurIPS website] + + [code] + + + +
+ + autocomplete + + benchmark + + code completion + + code generation + + compilation + + completion + + dataset + + evaluation + + language model + + large language models + + program analysis + + static analysis + + tool + +

+

Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating.

+ +

Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model.

+ +

We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Transformer-based Approach for Source Code Summarization

+

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. ACL 2020

+

+ + [ArXiV] + + [Code] + + + +
+ + summarization + +

+

Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens’ position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Unified Pre-training for Program Understanding and Generation

+

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. NAACL 2021

+

+ + [ArXiV] + + + +
+ + pretraining + + Transformer + +

+

Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on language generation tasks, including code summarization, generation, translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Lenient Parsing & Typing via Indirect Supervision

+

Toufique Ahmed, Vincent Hellendoorn, Premkumar Devanbu. 2019

+

+ + [ArXiV] + + + +
+ + types + +

+

Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in StackOverflow; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed & typed; this makes them more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse & type fragments, even ones with simple errors. Training a machine learner to leniently parse & type imperfect code requires a large training set of pairs of imperfect code and its repair (and/or type information); such training sets are limited by human effort and curation. In this paper, we present a novel indirectly supervised approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments (while requiring correct output) by seeding errors that mimic corruptions found in StackOverflow and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing & typing StackOverflow fragments; we also demonstrate that our approach achieves best-in-class performance on a large dataset of student errors.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning code summarization from a small and local dataset

+

Toufique Ahmed, Premkumar Devanbu. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + summarization + +

+

Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Studying LLM Performance on Closed- and Open-source Data

+

Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty. 2024

+

+ + [ArXiV] + + + +
+ + Transformers + +

+

Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS –> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Improving Few-Shot Prompts with Relevant Static Analysis Products

+

Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, Earl T. Barr. 2023

+

+ + [ArXiV] + + + +
+ + summarization + + Transformer + +

+

Large Language Models (LLM) are a new class of computation engines, “programmed” via prompt engineering. We are still learning how to best “program” these LLMs to help developers. We start with the intuition that developers tend to consciously and unconsciously have a collection of semantics facts in mind when working on coding tasks. Mostly these are shallow, simple facts arising from a quick read. For a function, examples of facts might include parameter and local variable names, return expressions, simple pre- and post-conditions, and basic control and data flow, etc.

+ +

One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them inherently capable of doing this simple level of “code analysis” and extracting such information, implicitly, while processing code: but are they, really? If they aren’t, could explicitly adding this information help? Our goal here is to investigate this question, using the code summarization task and evaluate whether automatically augmenting an LLM’s prompt with semantic facts explicitly, actually helps.

+ +

Prior work shows that LLM performance on code summarization benefits from few-shot samples drawn either from the same-project or from examples found via information retrieval methods (such as BM25). While summarization performance has steadily increased since the early days, there is still room for improvement: LLM performance on code summarization still lags its performance on natural-language tasks like translation and text summarization.

+ +

We find that adding semantic facts actually does help! This approach improves performance in several different settings suggested by prior work, including for two different Large Language Models. In most cases, improvement nears or exceeds 2 BLEU; for the PHP language in the challenging CodeSearchNet dataset, this augmentation actually yields performance surpassing 30 BLEU.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A large-scale benchmark for few-shot program induction and synthesis

+

Ferran Alet, Javier Lopez-Contreras, James Koppel, Maxwell Nye, Armando Solar-Lezama, Tomas Lozano-Perez, Leslie Kaelbling, Joshua Tenenbaum. ICML 2021

+

+ + [PMLR] + + [website] + + + +
+ + dataset + + synthesis + +

+

A landmark challenge for AI is to learn flexible, powerful representations from small numbers of examples. +On an important class of tasks, hypotheses in the form of programs provide extreme generalization capabilities from surprisingly few examples. However, whereas large natural few-shot learning image benchmarks have spurred progress in meta-learning for deep networks, there is no comparably big, natural program-synthesis dataset that can play a similar role. This is because, whereas images are relatively easy to label from internet meta-data or annotated by non-experts, generating meaningful input-output examples for program induction has proven hard to scale. In this work, we propose a new way of leveraging unit tests and natural inputs for small programs as meaningful input-output examples for each sub-program of the overall program. This allows us to create a large-scale naturalistic few-shot program-induction benchmark and propose new challenges in this domain. The evaluation of multiple program induction and synthesis algorithms points to shortcomings of current methods and suggests multiple avenues for future work.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SantaCoder: don’t reach for the stars!

+

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muenninghoff, Mayank Mishra, Alex Gu, Manan Den, Longesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Terry Yue Zhuo, Francesco De Toni, Bernanrdo Garcia del Rio, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Michael Lappert, Ian Yu, Paulo Villegas, Jia Li, David Lansy, Huu Nguyen, Danish Contractor, Luis Villa, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Arjun Guha, Harm de Vries, Leonadro von Werra. 2022

+

+ + + +
+ + Transformer + +

+

The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code.1 This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) +redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, +JavaScript, and Python subsets of The Stack (Kocetkov et al., 2022) and +evaluate the models on MultiPL-E (Cassano et al., 2022), a text2code +benchmark available in 18 programming languages. We find that more +aggressive filtering of near-duplicates can further boost performance and, +surprisingly, that selecting files from repositories with 5+ GitHub stars +deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and +CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the +Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL +license at https://hf.co/bigcode

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Mining Source Code Repositories at Massive Scale Using Language Modeling

+

Miltiadis Allamanis, Charles Sutton. MSR 2013

+

+ + [PDF] + + [data] + + [data@ Edinburgh DataShare] + + + +
+ + language model + +

+

The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new “lens” for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a program’s core logic based solely on general information theoretic criteria.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Natural Coding Conventions

+

Miltiadis Allamanis, Earl T. Barr, Christian Bird, Charles Sutton. FSE 2014

+

+ + [PDF] + + [ArXiV] + + [website] + + [code] + + + +
+ + naming + + language model + + style + +

+

Every programmer has a characteristic style, ranging from preferences +about identifier naming to preferences about object relationships and +design patterns. Coding conventions define a consistent syntactic style, +fostering readability and hence maintainability. When collaborating, +programmers strive to obey a project’s coding conventions. However, +one third of reviews of changes contain feedback about coding conventions, +indicating that programmers do not always follow them and that project +members care deeply about adherence. Unfortunately, programmers are +often unaware of coding conventions because inferring them requires a +global view, one that aggregates the many local decisions programmers +make and identifies emergent consensus on style. We present Naturalize, +a framework that learns the style of a codebase, and suggests revisions +to improve stylistic consistency. Naturalize builds on recent work in +applying statistical natural language processing to source code. We +apply Naturalize to suggest natural identifier names and formatting +conventions. We present four tools focused on ensuring natural code +during development and release management, including code review. +Naturalize achieves 94% accuracy in its top suggestions for identifier +names. We used Naturalize to generate 18 patches for 5 open source +projects: 14 were accepted.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Mining Idioms from Source Code

+

Miltiadis Allamanis, Charles Sutton. FSE 2014

+

+ + [PDF] + + [ArXiV] + + [data] + + + +
+ + pattern mining + + grammar + + grammar + +

+

We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic purpose. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present Haggis, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply Haggis to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Bimodal Modelling of Source Code and Natural Language

+

Miltiadis Allamanis, Daniel Tarlow, Andrew Gordon, Yi Wei. ICML 2015

+

+ + [Supplementary Material] + + [Presentation Video] + + + +
+ + search + + grammar + + grammar + + bimodal + +

+

We consider the problem of building probabilistic models that jointly +model short natural language utterances and source code snippets. The +aim is to bring together recent work on statistical modelling of source +code and work on bimodal models of images and natural language. The +resulting models are useful for a variety of tasks that involve natural +language and source code. We demonstrate their performance on two +retrieval tasks: retrieving source code snippets given a natural language +query, and retrieving natural language descriptions given a source code +query (i.e., source code captioning). Experiments show there to be +promise in this direction, and that modelling the structure of source +code improves performance.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Suggesting Accurate Method and Class Names

+

Miltiadis Allamanis, Earl T. Barr, Christian Bird, Charles Sutton. FSE 2015

+

+ + [PDF] + + [website] + + + +
+ + naming + +

+

Descriptive names are a vital part of readable, and hence maintainable, code. Recent progress on automatically suggesting names for local variables tantalizes with the prospect of replicating that success with method and class names. However, suggesting names for methods and classes is much more difficult. This is because good method and class names need to be functionally descriptive, but suggesting such names requires that the model goes beyond local context. We introduce a neural probabilistic language model for source code that is specifically designed for the method naming problem. Our model learns which names are semantically similar by assigning them to locations, called embeddings, in a high-dimensional continuous space, in such a way that names with similar embeddings tend to be used in similar contexts. These embeddings seem to contain semantic information about tokens, even though they are learned only from statistical co-occurrences of tokens. Furthermore, we introduce a variant of our model +that is, to our knowledge, the first that can propose neologisms, names that have not appeared in the training corpus. We obtain state of the art results on the method, class, and even the simpler variable naming tasks. More broadly, the continuous embeddings that are learned by our model have the potential for wide application within software engineering.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Convolutional Attention Network for Extreme Summarization of Source Code

+

Miltiadis Allamanis, Hao Peng, Charles Sutton. ICML 2016

+

+ + [website] + + [code] + + [proceedings] + + [presentation video] + + [GitXiV] + + + +
+ + naming + + summarization + +

+

Attention mechanisms in neural networks have proved useful for problems in which +the input and output do not have fixed dimension. Often there exist features that +are locally translation invariant and would be valuable for directing the model’s attention, +but previous attentional architectures are not constructed to learn such features specifically. +We introduce an attentional neural network that employs convolution on the input tokens to detect +local time-invariant and long-range topical attention features in a context-dependent way. We +apply this architecture to the problem of extreme summarization of source code snippets into short, +descriptive function name-like summaries. Using those features, the model sequentially generates a +summary by marginalizing over two attention mechanisms: one that predicts the next summary token based +n the attention weights of the input tokens and another that is able to copy a code token as-is directly +into the summary. We demonstrate our convolutional attention neural network’s performance on 10 popular Java +projects showing that it achieves better performance compared to previous attentional mechanisms.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Mining Semantic Loop Idioms from Big Code

+

Miltiadis Allamanis, Earl T. Barr, Christian Bird, Mark Marron, Charles Sutton. TSE 2017

+

+ + [MSR Technical Report] + + [website] + + + +
+ + pattern mining + + grammar + +

+

During maintenance, developers spend a lot of time transforming existing code: refactoring, optimizing, and adding checks to make it more robust. Much of this work is the drudgery of identifying and replacing specific patterns, yet it resists automation, because of meaningful patterns are hard to automatically find. We present a technique for mining loop idioms, surprisingly probable semantic patterns that occur in loops, from big code to find meaningful patterns. First, we show that automatically identifiable patterns exist, in great numbers, with a large scale empirical study of loop over 25 MLOC. We find that loops in this corpus are simple and predictable: 90% of them have fewer than 15LOC and 90% have no nesting and very simple control structure. Encouraged by this result, we coil loops to abstract away syntactic diversity to define information rich loop idioms. We show that only 50 loop idioms cover 50% of the concrete loops. We show how loop idioms can help a tool developers identify and prioritize refactorings. We also show how our framework opens the door to data-driven tool and language design discovering opportunities to introduce new API calls and language constructs: loop idioms show that LINQ would benefit from an Enumerate operator, a result confirmed by the fact that precisely this feature is one of the most requested features on StackOverflow with 197 votes and 95k views.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SmartPaste: Learning to Adapt Source Code

+

Miltiadis Allamanis, Marc Brockschmidt. 2017

+

+ + [ArXiV] + + + +
+ + representation + + variable misuse + +

+

Deep Neural Networks have been shown to succeed at a range of natural +language tasks such as machine translation and text summarization. +While tasks on source code (ie, formal languages) have been considered +recently, most work in this area does not attempt to capitalize on the +unique opportunities offered by its known syntax and structure. In this +work, we introduce SmartPaste, a first task that requires to use such +information. The task is a variant of the program repair problem that +requires to adapt a given (pasted) snippet of code to surrounding, +existing source code. As first solutions, we design a set of deep +neural models that learn to represent the context of each variable +location and variable usage in a data flow-sensitive way. Our +evaluation suggests that our models can learn to solve the SmartPaste +task in many cases, achieving 58.6% accuracy, while learning meaningful +representation of variable usages.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Represent Programs with Graphs

+

Miltiadis Allamanis, Marc Brockschmidt, Mahmoud Khademi. ICLR 2018

+

+ + [ArXiV] + + [GGNN Code] + + [Data] + + + +
+ + naming + + GNN + + representation + + variable misuse + + defect + +

+

Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried to transfer natural language methods and does not capitalize on the unique opportunities offered by code’s known syntax. For example, long-range dependencies induced by using the same variable or function in distant locations are often not considered. We propose to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures.

+ +

In this work, we present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to such large graphs. We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage, and VarMisuse, in which the network learns to reason about selecting the correct variable that should be used at a given program location. Our comparison to methods that use less structured program representations shows the advantages of modeling known structure, and suggests that our models learn to infer meaningful names and to solve the VarMisuse task in many cases. Additionally, our testing showed that VarMisuse identifies a number of bugs in mature open-source projects.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

The Adverse Effects of Code Duplication in Machine Learning Models of Code

+

Miltiadis Allamanis. 2019

+

+ + [ArXiV] + + [Dataset Errata] + + [Tool] + + + +
+ + dataset + + evaluation + +

+

The field of big code relies on mining large corpora of code to perform some learning task. A significant threat to this approach has been recently identified by Lopes et al. (2017) who found a large amount of code duplication on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this article, we study the effect of code duplication to machine learning models showing that reported metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present an “errata” for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them, and release tools to help the community avoid this problem in future research.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Typilus: Neural Type Hints

+

Miltiadis Allamanis, Earl T. Barr, Soline Ducousso, Zheng Gao. PLDI 2020

+

+ + [ArXiV] + + [Dataset] + + + +
+ + types + + GNN + +

+

Type inference over partial contexts in dynamically typed languages is challenging. In this work, we present a graph neural network model that predicts types by probabilistically reasoning over a program’s structure, names, and patterns. The network uses deep similarity learning to learn a TypeSpace – a continuous relaxation of the discrete space of types – and how to embed the type properties of a symbol (i.e. identifier) into it. Importantly, our model can employ one-shot learning to predict an open vocabulary of types, including rare and user-defined ones. We realise our approach in Typilus for Python that combines the TypeSpace with an optional type checker. We show that Typilus accurately predicts types. Typilus confidently predicts types for 70% of all annotatable symbols; when it predicts a type, that type optionally type checks 95% of the time. Typilus can also find incorrect type annotations; two important and popular open source libraries, fairseq and allennlp, accepted our pull requests that fixed the annotation errors Typilus discovered.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Self-Supervised Bug Detection and Repair

+

Miltiadis Allamanis, Henry Jackson-Flux, Marc Brockschmidt. NeurIPS 2021

+

+ + [ArXiV] + + + +
+ + GNN + + Transformer + + defect + + repair + +

+

Machine learning-based program analyses have recently shown the promise of integrating formal and probabilistic reasoning towards aiding software development. However, in the absence of large annotated corpora, training these analyses is challenging. Towards addressing this, we present BugLab, an approach for self-supervised learning of bug detection and repair. BugLab co-trains two models: (1) a detector model that learns to detect and repair bugs in code, (2) a selector model that learns to create buggy code for the detector to use as training data. A Python implementation of BugLab improves by up to 30% upon baseline methods on a test dataset of 2374 real-life bugs and finds 19 previously unknown bugs in open-source software.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

code2seq: Generating Sequences from Structured Representations of Code

+

Uri Alon, Omer Levy, Eran Yahav. ICLR 2019

+

+ + [ArXiV] + + + +
+ + naming + + summarization + + representation + +

+

The ability to generate natural language sequences from source code snippets has a variety of applications such as code summarization, documentation, and retrieval. Sequence-to-sequence (seq2seq) models, adopted from neural machine translation (NMT), have achieved state-of-the-art performance on these tasks by treating source code as a sequence of tokens. We present code2seq: an alternative approach that leverages the syntactic structure of programming languages to better encode source code. Our model represents a code snippet as the set of compositional paths in its abstract syntax tree (AST) and uses attention to select the relevant paths while decoding.

+ +

We demonstrate the effectiveness of our approach for two tasks, two programming languages, and four datasets of up to 16M examples. Our model significantly outperforms previous models that were specifically designed for programming languages, as well as general state-of-the-art NMT models. An interactive online demo of our model is available at http://code2seq.org.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A General Path-Based Representation for Predicting Program Properties

+

Uri Alon, Meital Zilberstein, Omer Levy, Eran Yahav. PLDI 2018

+

+ + [ArXiV] + + + +
+ + naming + + representation + +

+

Predicting program properties such as names or expression types has a wide range of applications. It can ease the task of programming and increase programmer productivity. A major challenge when learning from programs is how to represent programs in a way that facilitates effective learning. +We present a general path-based representation for learning from programs. Our representation is purely syntactic and extracted automatically. The main idea is to represent a program using paths in its abstract syntax tree (AST). This allows a learning model to leverage the structured nature of code rather than treating it as a flat sequence of tokens. +We show that this representation is general and can: (i) cover different prediction tasks, (ii) drive different learning algorithms (for both generative and discriminative models), and (iii) work across different programming languages. +We evaluate our approach on the tasks of predicting variable names, method names, and full types. We use our representation to drive both CRF-based and word2vec-based learning, for programs of four languages: JavaScript, Java, Python and C#. Our evaluation shows that our approach obtains better results than task-specific handcrafted representations across different tasks and programming languages.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

code2vec: Learning Distributed Representations of Code

+

Uri Alon, Omer Levy, Eran Yahav. POPL 2019

+

+ + [Code] + + + +
+ + naming + + summarization + + representation + +

+

We present a neural model for representing snippets of code as continuous distributed vectors (“code embeddings”). + The main idea is to represent a code snippet as a single fixed-length +code vector, which can be used to +predict semantic properties of the snippet. To this end, code is first decomposed to a collection of paths in its +abstract syntax tree. Then, the network learns the atomic representation of each path while +simultaneously +learning how to aggregate a set of them.

+ +

We demonstrate the effectiveness of our approach by using it to predict a method’s name from the vector +representation of its body. We evaluate our approach by training a model on a dataset of 12M methods. We +show that code vectors trained on this dataset can predict method names from files that were unobserved +during training. Furthermore, we show that our model learns useful method name vectors that capture +semantic similarities, combinations, and analogies.

+ +

A comparison of our approach to previous techniques over the same dataset shows an improvement of +more than 75%, making it the first to successfully predict method names based on a large, cross-project +corpus. Our trained model, visualizations and vector similarities are available as an interactive online demo at +http://code2vec.org. The code, data and trained models are available at +https://github.com/tech-srl/code2vec.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Structural Language Models for Any-Code Generation

+

Uri Alon, Roy Sadaka, Omer Levy, Eran Yahav. 2019

+

+ + [ArXiV] + + + +
+ + code generation + +

+

We address the problem of Any-Code Generation (AnyGen) - generating code without any restriction on the vocabulary or structure. The state-of-the-art in this problem is the sequence-to-sequence (seq2seq) approach, which treats code as a sequence and does not leverage any structural information. We introduce a new approach to AnyGen that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program’s abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous structural techniques that have severely restricted the kinds of expressions that can be generated, our approach can generate arbitrary expressions in any programming language. Our model significantly outperforms both seq2seq and a variety of existing structured approaches in generating Java and C# code. We make our code, datasets, and models available online.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Attribute Machines for Program Generation

+

Matthew Amodio, Swarat Chaudhuri, Thomas W. Reps. 2017

+

+ + + +
+ + grammar + + code generation + + representation + +

+

Recurrent neural networks have achieved remarkable success at generating sequences with complex structures, thanks to advances that include richer embeddings of input and cures for vanishing gradients. Trained only on sequences from a known grammar, though, they can still struggle to learn rules and constraints of the grammar. Neural Attribute Machines (NAMs) are equipped with a logical machine that represents the underlying grammar, which is used to teach the constraints to the neural machine by (i) augmenting the input sequence, and (ii) optimizing a custom loss function. Unlike traditional RNNs, NAMs are exposed to the grammar, as well as samples from the language of the grammar. During generation, NAMs make significantly fewer violations of the constraints of the underlying grammar than RNNs trained only on samples from the language of the grammar.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Towards Learning Representations of Binary Executable Files for Security Tasks

+

Shushan Arakelyan, Sima Arasteh, Christophe Hauser, Erik Kline, Aram Galstyan. AAAI 2020

+

+ + [ArXiV] + + + +
+ + GNN + + representation + +

+

Tackling binary analysis problems has traditionally implied manually defining rules and heuristics. As an alternative, we are suggesting using machine learning models for learning distributed representations of binaries that can be applicable for a number of downstream tasks. We construct a computational graph from the binary executable and use it with a graph convolutional neural network to learn a high dimensional representation of the program. We show the versatility of this approach by using our representations to solve two semantically different binary analysis tasks – algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results and demonstrate improvement on the state of the art methods for both tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Predicting Vulnerability in Large Codebases With Deep Code Representation

+

Anshul Tanwar, Krishna Sundaresan, Parmesh Ashwath, Prasanna Ganesan, Sathish Kumar Chandrasekaran, Sriram Ravi. 2020

+

+ + [ArXiV] + + + +
+ + grammar + + program analysis + + static analysis + +

+

Currently, while software engineers write code for various modules, quite often, various types of errors - coding, logic, semantic, and others (most of which are not caught by compilation and other tools) get introduced. Some of these bugs might be found in the later stage of testing, and many times it is reported by customers on production code. Companies have to spend many resources, both money and time in finding and fixing the bugs which would have been avoided if coding was done right. Also, concealed flaws in software can lead to security vulnerabilities that potentially allow attackers to compromise systems and applications. Interestingly, same or similar issues/bugs, which were fixed in the past (although in different modules), tend to get introduced in production code again. +We developed a novel AI-based system which uses the deep representation of Abstract Syntax Tree (AST) created from the source code and also the active feedback loop to identify and alert the potential bugs that could be caused at the time of development itself i.e. as the developer is writing new code (logic and/or function). This tool integrated with IDE as a plugin would work in the background, point out existing similar functions/code-segments and any associated bugs in those functions. The tool would enable the developer to incorporate suggestions right at the time of development, rather than waiting for UT/QA/customer to raise a defect. +We assessed our tool on both open-source code and also on Cisco codebase for C and C++ programing language. Our results confirm that deep representation of source code and the active feedback loop is an assuring approach for predicting security and other vulnerabilities present in the code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Autocompletion from Real-World Datasets

+

Gareth Ari Aye, Seohyun Kim, Hongyu Li. 2020

+

+ + [ArXiV] + + + +
+ + autocomplete + +

+

Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study When Code Completion Fails: a Case Study on Real-World Completions demonstrates that these results may not translate to improvements in real-world performance. To combat this effect, we train models on real-world code completion examples and find that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers’ actual autocompletion usage. Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Sequence Model Design for Code Completion in the Modern IDE

+

Gareth Ari Aye, Gail E. Kaiser. Optional 2020

+

+ + [ArXiV] + + + +
+ + autocomplete + +

+

Code completion plays a prominent role in modern integrated development environments (IDEs). Machine learning has become ubiquitous in analogous natural language writing and search software, surfacing more relevant autocompletions and search suggestions in fewer keystrokes. Prior research has reported training high-accuracy, deep neural networks for modeling source code, but little attention has been given to the practical constraints imposed by interactive developer tools. In particular, neural language models for source code modeling like the one described in Maybe Deep Neural Networks are the Best Choice for Modeling Source Code are framed around code completion, but only report accuracy of next-token prediction. However, in order for a language model (LM) to work well within real-world code completion systems, it must also always make suggestions that produce valid code that typechecks to support code completion’s role in correctness-checking; return instantaneous results to help programmers code more efficiently in fewer keystrokes; and be small enough to fit comfortably on disk and in memory on developer workstations, since virtually all modern IDEs run locally and support offline usage. To meet these additional requirements, we propose a novel design for predicting top-k next tokens that combines static analysis’ ability to enumerate all valid keywords and in-scope identifiers with the ability of a language model to place a probability distribution over them. Our model mixes character-level input representation with token output to represent out-of-vocabulary (OOV) tokens meaningfully and minimize prediction latency. OOV tokens can be predicted through detection of local repetition common in software. This design achieves state-of-art accuracy in source code modeling and fits the constraints imposed by real-world code completion implementations in modern IDEs.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Jointly Learning to Repair Code and Generate Commit Message

+

Jiaqi Bai, Long Zhou, Ambrosio Blanco, Shujie Liu, Furu Wei, Ming Zhou, Zhoujun Li. 2021

+

+ + [ArXiV] + + + +
+ + edit + + Transformer + +

+

We propose a novel task of jointly repairing program codes and generating commit messages. Code repair and commit message generation are two essential and related tasks for software development. However, existing work usually performs the two tasks independently. We construct a multilingual triple dataset including buggy code, fixed code, and commit messages for this novel task. We provide the cascaded models as baseline, which are enhanced with different training approaches, including the teacher-student method, the multi-task method, and the back-translation method. To deal with the error propagation problem of the cascaded method, the joint model is proposed that can both repair the code and generate the commit message in a unified framework. Experimental results show that the enhanced cascaded model with teacher-student method and multitask-learning method achieves the best score on different metrics of automated code repair, and the joint model behaves better than the cascaded model on commit message generation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code Mapping in Heterogeneous Platforms Using Deep Learning and LLVM-IR

+

Francesco Barchi, Gianvito Urgese, Enrico Macii, Andrea Acquaviva. DAC 2019

+

+ + [ACM] + + [code] + + + +
+ + optimization + + program analysis + + static analysis + + natural language processing + +

+

Modern heterogeneous platforms require compilers capable of choosing the appropriate device for the execution of program portions. This paper presents a machine learning method designed for supporting mapping decisions through the analysis of the program source code represented in LLVM assembly language (IR) for exploiting the advantages offered by this generalised and optimised representation. To evaluate our solution, we trained an LSTM neural network on OpenCL kernels compiled in LLVM-IR and processed with our tokenizer capable of filtering less-informative tokens. We tested the network that reaches an accuracy of 85% in distinguishing the best computational unit.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Exploration of Convolutional Neural Network models for source code classification

+

Francesco Barchi, Emanuele Parisi, Gianvito Urgese, Elisa Ficarra, Andrea Acquaviva. Engineering Applications of Artificial Intelligence 2021

+

+ + [ScienceDirect] + + [code] + + + +
+ + optimization + + static analysis + + program analysis + + language model + +

+

The application of Artificial Intelligence is becoming common in many engineering fields. Among them, one of the newest and rapidly evolving is software generation, where AI can be used to automatically optimise the implementation of an algorithm for a given computing platform. In particular, Deep Learning technologies can be used to the decide how to allocate pieces of code to hardware platforms with multiple cores and accelerators, that are common in high performance and edge computing applications. In this work, we explore the use of Convolutional Neural Networks (CNN)s to analyse the application source code and decide the best compute unit to minimise the execution time. We demonstrate that CNN models can be successfully applied to source code classification, providing higher accuracy with consistently reduced learning time with respect to state-of-the-art methods. Moreover, we show the robustness of the method with respect to source code pre-processing, compiler options and hyper-parameters selection.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities

+

Francesco Barchi, Emanuele Parisi, Andrea Bartolini, Andrea Acquaviva. Journal of Low Power Electronics and Applications 2022

+

+ + [MDPI] + + + +
+ + optimization + + review + +

+

To cope with the increasing complexity of digital systems programming, deep learning techniques have recently been proposed to enhance software deployment by analysing source code for different purposes, ranging from performance and energy improvement to debugging and security assessment. As embedded platforms for cyber-physical systems are characterised by increasing heterogeneity and parallelism, one of the most challenging and specific problems is efficiently allocating computational kernels to available hardware resources. In this field, deep learning applied to source code can be a key enabler to face this complexity. However, due to the rapid development of such techniques, it is not easy to understand which of those are suitable and most promising for this class of systems. For this purpose, we discuss recent developments in deep learning for source code analysis, and focus on techniques for kernel mapping on heterogeneous platforms, highlighting recent results, challenges and opportunities for their applications to cyber-physical systems.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code

+

Patrick Bareiß, Beatriz Souza, Marcelo d'Amorim, Michael Pradel. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code, e.g., how to complete a given code example, or even generate code snippets from scratch. The success of these models raises the question whether they could serve as a basis for building a wide range code generation tools. Traditionally, such tools are built manually and separately for each task. Instead, few-shot learning may allow to obtain different tools from a single pre-trained language model by simply providing a few examples or a natural language description of the expected tool behavior. This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose. We consider three code manipulation and code generation tasks targeted by a range of traditional tools: (i) code mutation; (ii) test oracle generation from natural language documentation; and (iii) test case generation. For each task, we compare few-shot learning to a manually built tool. Our results show that the model-based tools complement (code mutation), are on par (test oracle generation), or even outperform their respective traditionally built tool (test case generation), while imposing far less effort to develop them. By comparing the effectiveness of different variants of the model-based tools, we provide insights on how to design an appropriate input (“prompt”) to the model and what influence the size of the model has. For example, we find that providing a small natural language description of the code generation task is an easy way to improve predictions. Overall, we conclude that few-shot language models are surprisingly effective, yet there is still more work to be done, such as exploring more diverse ways of prompting and tackling even more involved tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Grounded Copilot: How Programmers Interact with Code-Generating Models

+

Shraddha Barke, Michael B. James, Nadia Polikarpova. 2022

+

+ + [ArXiV] + + + +
+ + human evaluation + + synthesis + +

+

Powered by recent advances in code-generating models, AI assistants like Github Copilot promise to change the face of programming forever. But what is this new face of programming? We present the first grounded theory analysis of how programmers interact with Copilot, based on observing 20 participants–with a range of prior experience using the assistant–as they solve diverse programming tasks across four languages. Our main finding is that interactions with programming assistants are bimodal: in acceleration mode, the programmer knows what to do next and uses Copilot to get there faster; in exploration mode, the programmer is unsure how to proceed and uses Copilot to explore their options. Based on our theory, we provide recommendations for improving the usability of future AI programming assistants.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

+

Antonio Valerio Miceli Barone, Rico Sennrich. 2017

+

+ + [ArXiV] + + [code] + + + +
+ + documentation + + summarization + + dataset + +

+

Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains.

+ +

In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings (“docstrings”) generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with +data augmentation techniques to further increase the amount of training data.

+ +

We release our datasets and processing scripts in order to stimulate research in these areas.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Efficient Training of Language Models to Fill in the Middle

+

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, Mark Chen. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + language model + +

+

We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts

+

Rohan Bavishi, Michael Pradel, Koushik Sen. 2017

+

+ + [ArXiV] + + + +
+ + naming + +

+

Most of the JavaScript code deployed in the wild has been minified, a process in which identifier names are replaced +with short, arbitrary and meaningless names. Minified code occupies less space, but also makes the code extremely difficult to manually inspect and understand. This paper presents Context2Name, a deep learning-based technique that partially reverses the effect of minification by predicting natural +identifier names for minified names. The core idea is to predict from the usage context of a variable a name that captures +the meaning of the variable. The approach combines a lightweight, token-based static analysis with an auto-encoder +neural network that summarizes usage contexts and a recurrent neural network that predict natural names for a given +usage context. We evaluate Context2Name +with a large corpus of real-world JavaScript code and show that it successfully predicts 60.4% of all minified identifiers. A comparison +with the state-of-the-art tools JSNice and JSNaughty shows +that our approach predicts 17% and 43% more names than the +best existing approaches, while taking only 2.6 milliseconds +to predict a name, on average.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

AutoPandas: neural-backed generators for program synthesis

+

Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, Ion Stoica. OOPSLA 2019

+

+ + + +
+ + synthesis + + GNN + + API + +

+

Developers nowadays have to contend with a growing number of APIs. While in the long-term they are very useful to developers, many modern APIs have an incredibly steep learning curve, due to their hundreds of functions handling many arguments, obscure documentation, and frequently changing semantics. For APIs that perform data transformations, novices can often provide an I/O example demonstrating the desired transformation, but may be stuck on how to translate it to the API. A programming-by-example synthesis engine that takes such I/O examples and directly produces programs in the target API could help such novices. Such an engine presents unique challenges due to the breadth of real-world APIs, and the often-complex constraints over function arguments. We present a generator-based synthesis approach to contend with these problems. This approach uses a program candidate generator, which encodes basic constraints on the space of programs. We introduce neural-backed operators which can be seamlessly integrated into the program generator. To improve the efficiency of the search, we simply use these operators at non-deterministic decision points, instead of relying on domain-specific heuristics. We implement this technique for the Python pandas library in AutoPandas. AutoPandas supports 119 pandas dataframe transformation functions. We evaluate AutoPandas on 26 real-world benchmarks and find it solves 17 of them.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

pix2code: Generating Code from a Graphical User Interface Screenshot

+

Tony Beltramelli. 2017

+

+ + [ArXiV] + + + +
+ + code generation + + bimodal + +

+

Transforming a graphical user interface screenshot created by a designer into computer code is a typical task conducted by a developer in order to build customized software, websites and mobile applications. In this paper, we show that Deep Learning techniques can be leveraged to automatically generate code given a graphical user interface screenshot as input. Our model is able to generate code targeting three different platforms (i.e. iOS, Android and web-based technologies) from a single input image with over 77% of accuracy.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Code Comprehension: A Learnable Representation of Code Semantics

+

Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler. NeurIPS 2018

+

+ + + +
+ + representation + +

+

With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that with a single RNN architecture and pre-trained fixed embeddings, inst2vec outperforms specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer

+

Berkay Berabi, Jingxuan He, Veselin Raychev, Martin Vechev. ICML 2021

+

+ + [Code & Dataset] + + + +
+ + repair + +

+

The problem of fixing errors in programs has attracted substantial interest over the years. The +key challenge for building an effective code fixing tool is to capture a wide range of errors and +meanwhile maintain high accuracy. In this paper, we address this challenge and present a new +learning-based system, called TFix. TFix works +directly on program text and phrases the problem of code fixing as a text-to-text task. In turn, +this enables it to leverage a powerful Transformer +based model pre-trained on natural language and +fine-tuned to generate code fixes (via a large, high-quality dataset obtained from GitHub commits). +TFix is not specific to a particular programming +language or class of defects and, in fact, improved +its precision by simultaneously fine-tuning on 52 +different error types reported by a popular static +analyzer. Our evaluation on a massive dataset of +JavaScript programs shows that TFix is practically +effective: it is able to synthesize code that fixes +the error in ∼67 percent of cases and significantly +outperforms existing learning-based approaches.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models

+

Berkay Berabi, Alexey Gronskiy, Veselin Raychev, Gishor Sivanrupan, Victor Chibotaru, Martin Vechev. 2024

+

+ + [ArXiV] + + + +
+ + repair + + vulnerability + +

+

The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging large language models (LLMs), which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM’s attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks

+

Sahil Bhatia, Rishabh Singh. 2016

+

+ + [ArXiV] + + + +
+ + repair + +

+

We present a method for automatically generating repair feedback for syntax errors for introductory programming problems. Syntax errors constitute one of the largest classes of errors (34%) in our dataset of student submissions obtained from a MOOC course on edX. The previous techniques for generating automated feedback on programming assignments have focused on functional correctness and style considerations of student programs. These techniques analyze the program AST of the program and then perform some dynamic and symbolic analyses to compute repair feedback. Unfortunately, it is not possible to generate ASTs for student programs with syntax errors and therefore the previous feedback techniques are not applicable in repairing syntax errors. We present a technique for providing feedback on syntax errors that uses Recurrent neural networks (RNNs) to model syntactically valid token sequences. Our approach is inspired from the recent work on learning language models from Big Code (large code corpus). For a given programming assignment, we first learn an RNN to model all valid token sequences using the set of syntactically correct student submissions. Then, for a student submission with +syntax errors, we query the learnt RNN model with the prefix token sequence to predict token sequences that can fix the error by either replacing or inserting the predicted token sequence at the error location. We evaluate our technique on over 14, 000 student submissions with syntax errors. Our technique can completely repair 31.69% (4501/14203) of submissions with syntax errors and in addition partially correct 6.39% (908/14203) of the submissions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neuro-symbolic program corrector for introductory programming assignments

+

Sahil Bhatia, Pushmeet Kohli, Rishabh Singh. ICSE 2018

+

+ + + +
+ + repair + +

+

Automatic correction of programs is a challenging problem with numerous real world applications in security, verification, and education. One application that is becoming increasingly important is the correction of student submissions in online courses for providing feedback. Most existing program repair techniques analyze Abstract Syntax Trees (ASTs) of programs, which are unfortunately unavailable for programs with syntax errors. In this paper, we propose a novel Neuro-symbolic approach that combines neural networks with constraint-based reasoning. Specifically, our method first uses a Recurrent Neural Network (RNN) to perform syntax repairs for the buggy programs; subsequently, the resulting syntactically-fixed programs are repaired using constraint-based techniques to ensure functional correctness. The RNNs are trained using a corpus of syntactically correct submissions for a given programming assignment, and are then queried to fix syntax errors in an incorrect programming submission by replacing or inserting the predicted tokens at the error location. We evaluate our technique on a dataset comprising of over 14,500 student submissions with syntax errors. Our method is able to repair syntax errors in 60% (8689) of submissions, and finds functionally correct repairs for 23.8% (3455) submissions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Python Code Suggestion with a Sparse Pointer Network

+

Avishkar Bhoopchand, Tim Rocktaschel, Earl Barr, Sebastian Riedel. 2016

+

+ + [ArXiV] + + + +
+ + language model + + autocomplete + +

+

To enhance developer productivity, all modern integrated development environments (IDEs) include code suggestion functionality that proposes likely next tokens at the cursor. While current IDEs work well for statically-typed languages, their reliance on type annotations means that they do not provide the same level of support for dynamic programming languages as for statically-typed languages. Moreover, suggestion engines in modern IDEs do not propose expressions or multi-statement idiomatic code. Recent work has shown that language models can improve code suggestion systems by learning from software repositories. This paper introduces a neural language model with a sparse pointer network aimed at capturing very long-range dependencies. We release a large-scale code suggestion corpus of 41M lines of Python code crawled from GitHub. On this corpus, we found standard neural language models to perform well at suggesting local phenomena, but struggle to refer to identifiers that are introduced many tokens in the past. By augmenting a neural language model with a pointer network specialized in referring to predefined classes of identifiers, we obtain a much lower perplexity and a 5 percentage points increase in accuracy for code suggestion compared to an LSTM baseline. In fact, this increase in code suggestion accuracy is due to a 13 times more accurate prediction of identifiers. Furthermore, a qualitative analysis shows this model indeed captures interesting long-range dependencies, like referring to a class member defined over 60 tokens in the past.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SinkFinder: harvesting hundreds of unknown interesting function pairs with just one seed

+

Pan Bian, Bin Liang, Jianjun Huang, Wenchang Shi, Xidong Wang, Jian Zhang. FSE 2020

+

+ + + +
+ + program analysis + +

+

Mastering the knowledge about security-sensitive functions that can potentially result in bugs is valuable to detect them. However, identifying this kind of functions is not a trivial task. Introducing machine learning-based techniques to do the task is a natural choice. Unfortunately, the approach also requires considerable prior knowledge, e.g., sufficient labelled training samples. In practice, the requirement is often hard to meet.

+ +

In this paper, to solve the problem, we propose a novel and practical method called SinkFinder to automatically discover function pairs that we are interested in, which only requires very limited prior knowledge. SinkFinder first takes just one pair of well-known interesting functions as the initial seed to infer enough positive and negative training samples by means of sub-word word embedding. By using these samples, a support vector machine classifier is trained to identify more interesting function pairs. Finally, checkers equipped with the obtained knowledge can be easily developed to detect bugs in target systems. The experiments demonstrate that SinkFinder can successfully discover hundreds of interesting functions and detect dozens of previously unknown bugs from large-scale systems, such as Linux, OpenSSL and PostgreSQL.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs

+

Vitaliy Bibaev, Alexey Kalina, Vadim Lomshakov, Yaroslav Golubev, Alexander Bezzubov, Nikita Povarov, Timofey Bryksin. ESEC/FSE 2022

+

+ + [ArXiV] + + + +
+ + autocomplete + +

+

We propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates. +We developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs. +We used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model. +Then, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE. +Our evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience. +Compared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832. +The approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client’s side. +Importantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Statistical Deobfuscation of Android Applications

+

Benjamin Bichsel, Veselin Raychev, Petar Tsankov, Martin Vechev. CCS 2016

+

+ + + +
+ + deobfuscation + + naming + +

+

This work presents a new approach for deobfuscating Android APKs based on probabilistic learning of large code bases (termed “Big Code”). The key idea is to learn a probabilistic model over thousands of non-obfuscated Android applications and to use this probabilistic model to deobfuscate new, unseen Android APKs. The concrete focus of the paper is on reversing layout obfuscation, a popular transformation which renames key program elements such as classes, packages, and methods, thus making it difficult to understand what the program does. Concretely, the paper: (i) phrases the layout deobfuscation problem of Android APKs as structured prediction in a probabilistic graphical model, (ii) instantiates this model with a rich set of features and constraints that capture the Android setting, ensuring both semantic equivalence and high prediction accuracy, and (iii) shows how to leverage powerful inference and learning algorithms to achieve overall precision and scalability of the probabilistic predictions.

+ +

We implemented our approach in a tool called DeGuard and used it to: (i) reverse the layout obfuscation performed by the popular ProGuard system on benign, open-source applications, (ii) predict third-party libraries imported by benign APKs (also obfuscated by ProGuard), and (iii) rename obfuscated program elements of Android malware. The experimental results indicate that DeGuard is practically effective: it recovers 79.1% of the program element names obfuscated with ProGuard, it predicts third-party libraries with accuracy of 91.3%, and it reveals string decoders and classes that handle sensitive data in Android malware.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks

+

David Bieber, Charles Sutton, Hugo Larochelle, Daniel Tarlow. NeurIPS 2020

+

+ + [ArXiV] + + + +
+ + representation + + dynamic + +

+

Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural networks (RNNs), on the other hand, are well-suited to long sequential chains of reasoning, but they do not naturally incorporate program structure and generally perform worse on the above tasks. Our aim is to achieve the best of both worlds, and we do so by introducing a novel GNN architecture, the Instruction Pointer Attention Graph Neural Networks (IPA-GNN), which achieves improved systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by considering RNNs operating on program traces with branch decisions as latent variables. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution. To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a heuristic function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions

+

David Bieber, Rishab Goel, Daniel Zheng, Hugo Larochelle, Daniel Tarlow. 2022

+

+ + [ArXiV] + + [Dataset] + + + +
+ + dataset + + defect + +

+

The execution behavior of a program often depends on external resources, such as program inputs or file contents, and so cannot be run in isolation. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors as early as possible, even before programs can be compiled and run. This presents an interesting machine learning challenge: can we predict runtime errors in a “static” setting, where program execution is not possible? Here, we introduce a real-world dataset and task for predicting runtime errors, which we show is difficult for generic models like Transformers. We approach this task by developing an interpreter-inspired architecture with an inductive bias towards mimicking program executions, which models exception handling and “learns to execute” descriptions of the contents of external resources. Surprisingly, we show that the model can also predict the location of the error, despite being trained only on labels indicating the presence/absence and kind of error. In total, we present a practical and difficult-yet-approachable challenge problem related to learning program execution and we demonstrate promising new capabilities of interpreter-inspired machine learning models for code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

PHOG: Probabilistic Model for Code

+

Pavol Bielik, Veselin Raychev, Martin Vechev. ICML 2016

+

+ + + +
+ + grammar + + code generation + + language model + +

+

We introduce a new generative model for code called probabilistic higher order grammar (PHOG). PHOG generalizes probabilistic context free grammars (PCFGs) by allowing conditioning of a production rule beyond the parent non-terminal, thus capturing rich contexts relevant to programs. Even though PHOG is more powerful than a PCFG, it can be learned from data just as efficiently. We trained a PHOG model on a large JavaScript code corpus and show that it is more precise than existing models, while similarly fast. As a result, PHOG can immediately benefit existing programming tools based on probabilistic models of code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Adversarial Robustness for Code

+

Pavol Bielik, Martin Vechev. 2020

+

+ + [ArXiV] + + + +
+ + adversarial + + types + +

+

We propose a novel technique which addresses the challenge of learning accurate and robust models of code in a principled way. Our method consists of three key components: (i) learning to abstain from making a prediction if uncertain, (ii) adversarial training, and (iii) representation refinement which learns the program parts relevant for the prediction and abstracts the rest. These components are used to iteratively train multiple models, each of which learns a suitable program representation necessary to make robust predictions on a different subset of the dataset. We instantiated our approach to the task of type inference for dynamically typed languages and demonstrate its effectiveness by learning a model that achieves 88% accuracy and 84% robustness. Further, our evaluation shows that using the combination of all three components is key to obtaining accurate and robust models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

TraceFixer: Execution Trace-Driven Program Repair

+

Islem Bouzenia, Yangruibo Ding, Kexin Pei, Baishakhi Ray, Michael Pradel. 2023

+

+ + [ArXiV] + + + +
+ + Transformer + + repair + + dynamic + +

+

When debugging unintended program behavior, developers can often identify the point in the execution where the actual behavior diverges from the desired behavior. For example, a variable may get assigned a wrong value, which then negatively influences the remaining computation. Once a developer identifies such a divergence, how to fix the code so that it provides the desired behavior? This paper presents TraceFixer, a technique for predicting how to edit source code so that it does not diverge from the expected behavior anymore. The key idea is to train a neural program repair model that not only learns from source code edits but also exploits excerpts of runtime traces. The input to the model is a partial execution trace of the incorrect code, which can be obtained automatically through code instrumentation, and the correct state that the program should reach at the divergence point, which the user provides, e.g., in an interactive debugger. Our approach fundamentally differs from current program repair techniques, which share a similar goal but exploit neither execution traces nor information about the desired program state. We evaluate TraceFixer on single-line mistakes in Python code. After training the model on hundreds of thousands of code edits created by a neural model that mimics real-world bugs, we find that exploiting execution traces improves the bug-fixing ability by 13% to 20% (depending on the dataset, within the top-10 predictions) compared to a baseline that learns from source code edits only. Applying TraceFixer to 20 real-world Python bugs shows that the approach successfully fixes 10 of them.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

+

Islem Bouzenia, Premkumar Devanbu, Michael Pradel. 2024

+

+ + [ArXiV] + + + +
+ + repair + +

+

Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent’s effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI’s GPT-3.5 model, translates to 14 cents of USD per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Can Large Language Model Detect Plagiarism in Source Code?

+

William Brach, Kristián Košťál, Michal Ries. FLLM 2024

+

+ + [IEEE] + + [website] + + [code] + + + +
+ + code similarity + + large language models + + LLM + + plagiarism detection + + natural language processing + +

+

The issue of code plagiarism represents a significant challenge in the academic environment. This study examines the potential of large language models (LLMs) in improving the detection of code plagiarism. The performance of several LLMs, including GPT-4o, GPT3.5 Turbo, LLaMA 3, and CodeLlama, is evaluated in comparison to conventional tools, such as JPlag, across a range of levels of code plagiarism. The findings of our study illustrate that state-of-the-art LLMs are able to outperform traditional methods, particularly in the detection of sophisticated forms of plagiarism. GPT-4o exhibited the highest overall accuracy (78.70%) and an F1 score of 86.97%. It is important to note that open-source models, such as LLaMA 3 (accuracy 71.53%, F1 score 82.75%), demonstrated the ability to detect the most complex forms of plagiarism with the same accuracy as GPT-4o. While these results demonstrate the promising potential of LLMs in code similarity analysis, it is also evident that higher false positive rates may be an inherent limitation, emphasizing the need for human oversight. This study contributes valuable insights into the application of AI in maintaining code integrity and academic honesty, paving the way for more effective, interpretable, and fair plagiarism detection systems in software development education and practice.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Compiler-based graph representations for deep learning models of code

+

Alexander Brauckmann, Andres Goens, Sebastian Ertel, Jeronimo Castrillon. CC 2020

+

+ + [ACM] + + + +
+ + representation + + compilation + + optimization + + GNN + +

+

In natural language processing, novel methods in deep learning, like recurrent neural networks (RNNs) on sequences of words, have been very successful. These methods have also been used recently for tasks in compiler optimization, like heterogeneous mapping of OpenCL kernels or predicting thread coarsening factors for optimal execution times. In contrast to natural languages, programming languages usually have a well-defined structure. This structure is what enables compilers to reason about programs on the foundations of graphs, such as abstract syntax trees (ASTs) or control-data flow graphs (CDFGs). +In this paper, we argue that we should use these graph structures instead of word sequences for learning compiler optimization tasks. To this end we apply recently proposed graph neural networks (GNNs) for learning predictive compiler tasks on two representations based on ASTs and CDFGs. Experimental results show how these representations improve upon the accuracy of the state-of-the-art in the task of heterogeneous OpenCL mapping, while providing orders of magnitude faster inference times, which are crucial for compiler optimizations. When testing on benchmark suites not included for training, our graph-based methods significantly outperform the state-of-the art by 12 percentage points in terms of accuracy, and are the only ones to perform better than a random mapping. When testing on the task of predicting thread coarsening factors, we expose current limitations of deep learning in compilers. We show how all of the deep learning approaches proposed so far, including our graph-based models, fail to produce an overall speedup with their predictions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

ComPy-Learn: A toolbox for exploring machine learning representations for compilers

+

Alexander Brauckmann, Andrés Goens, Jeronimo Castrillon. FDL 2020

+

+ + [IEEE] + + [Code] + + + +
+ + representation + + compilation + + optimization + + GNN + +

+

Deep Learning methods have not only shown to improve software performance in compiler heuristics, but also e.g. to improve security in vulnerability prediction or to boost developer productivity in software engineering tools. A key to the success of such methods across these use cases is the expressiveness of the representation used to abstract from the program code. Recent work has shown that different such representations have unique advantages in terms of performance. However, determining the best-performing one for a given task is often not obvious and requires empirical evaluation. +Therefore, we present ComPy-Learn, a toolbox for conveniently defining, extracting, and exploring representations of program code. With syntax-level language information from the Clang compiler frontend and low-level information from the LLVM compiler backend, the tool supports the construction of linear and graph representations and enables an efficient search for the best-performing representation and model for tasks on program code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

OffSide: Learning to Identify Mistakes in Boundary Conditions

+

Jón Arnar Briem, Jordi Smit, Hendrig Sellik, Pavel Rapoport, Georgios Gousios, Maurício Aniche.. 2nd Workshop on Testing for Deep Learning and Deep Learning for Testing 2020

+

+ + [Preprint] + + + +
+ + defect + +

+

Mistakes in boundary conditions are the cause of many bugs in software. +These mistakes happen when, e.g., developers make use of < or > in cases +where they should have used <= or >=. Mistakes in boundary conditions +are often hard to find and manually detecting them might be very time-consuming +for developers. While researchers have been proposing techniques to cope with +mistakes in the boundaries for a long time, the automated detection of such bugs still +remains a challenge. We conjecture that, for a tool to be able to precisely identify mistakes +in boundary conditions, it should be able to capture the overall context of the source code +under analysis. In this work, we propose a deep learning model that learn mistakes in boundary +conditions and, later, is able to identifythem in unseen code snippets. We train and test a +model on over 1.5 million code snippets, with and without mistakes in different boundary conditions. +Our model shows an accuracy from 55% up to 87%. The model is also able to detect 24 out of 41 +real-world bugs;however, with a high false positive rate. The existing state-of-the-practice linter +tools are not able to detect any of the bugs. We hope this paper can pave the road towards deep +learning models that will be able to support developers in detecting mistakes in boundary conditions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Generative Code Modeling with Graphs

+

Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, Oleksandr Polozov. ICLR 2019

+

+ + [ArXiV] + + [OpenReview] + + [Code] + + + +
+ + grammar + + code generation + + GNN + +

+

Generative models forsource code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem that uses a graph to represent the intermediate state of the generated output. Our model generates code by interleaving grammar-driven expansion steps with graph augmentation and neural message passing steps. An experimental evaluation shows that our new model can generate semantically meaningful expressions, outperforming a range of strong baselines.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Structural Model for Contextual Code Changes

+

Shaked Brody, Uri Alon, Eran Yahav. OOPSLA 2020

+

+ + [ArXiV] + + [Code] + + + +
+ + edit + + grammar + + autocomplete + +

+

We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, rather than learning the likelihood of the edited code. We represent an edit operation as a path in the program’s Abstract Syntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, we present a powerful and lightweight neural model for the EditCompletion task. We conduct a thorough evaluation, comparing our approach to a variety of representation and modeling approaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Our experiments show that our model achieves 28% relative gain over state-of-the-art sequential models and 2× higher accuracy than syntactic models that learn to generate the edited code instead of modeling the edits directly. Our code, dataset, and trained models are publicly available at https://github.com/tech-srl/c3po/ .

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning from Examples to Improve Code Completion Systems

+

Marcel Bruch, Martin Monperrus, Mira Mezini.. ESEC/FSE 2009

+

+ + + +
+ + autocomplete + +

+

The suggestions made by current IDE’s code completion features are based exclusively on static type system of the programming language. As a result, often proposals are made which are irrelevant for a particular working context. Also, these suggestions are ordered alphabetically rather than by their relevance in a particular context. In this paper, we present intelligent code completion systems that learn from existing code repositories. We have implemented three such systems, each using the information contained in +repositories in a different way. We perform a large-scale quantitative evaluation of these systems, integrate the best performing one into Eclipse, and evaluate the latter also by a user study. Our experiments give evidence that intelligent code completion systems which learn from examples significantly outperform mainstream code completion systems in terms of the relevance of their suggestions and thus have the potential to enhance developers’ productivity.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning-based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection

+

Lutz Büch, Artur Andrzejak. SANER 2019

+

+ + [IEEEexplore] + + [website_pdf] + + [TR] + + + +
+ + grammar + + grammar + + clone + +

+

Code clone detection remains a crucial challenge in maintaining software projects. Many classic approaches rely on handcrafted aggregation schemes, while recent work uses supervised or unsupervised learning. In this work, we study several aspects of aggregation schemes for code clone detection based on supervised learning. To this aim, we implement an AST-based Recursive Neural Network. Firstly, our ablation study shows the influence of model choices and hyperparameters. We introduce error scaling as a way to effectively and efficiently address the class imbalance problem arising in code clone detection. Secondly, we study the influence of pretrained embeddings representing nodes in ASTs. We show that simply averaging all node vectors of a given AST yields strong baseline aggregation scheme. Further, learned AST aggregation schemes greatly benefit from pretrained node embeddings. Finally, we show the importance of carefully separating training and test data by clone clusters, to reliably measure generalization of models learned with supervision.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification

+

Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang. SANER 2018

+

+ + [TR] + + + +
+ + representation + +

+

Algorithm classification is to automatically identify +the classes of a program based on the algorithm(s) and/or data +structure(s) implemented in the program. It can be useful for +various tasks, such as code reuse, code theft detection, and malware detection. Code similarity metrics, on the basis of features +extracted from syntax and semantics, have been used to classify +programs. Such features, however, often need manual selection +effort and are specific to individual programming languages, +limiting the classifiers to programs in the same language. +To recognize the similarities and differences among algorithms +implemented in different languages, this paper describes a +framework of Bilateral Neural Networks (Bi-NN) that builds a +neural network on top of two underlying sub-networks, each of +which encodes syntax and semantics of code in one language. A +whole Bi-NN can be trained with bilateral programs that implement the same algorithms and/or data structures in different +languages and then be applied to recognize algorithm classes +across languages.

+ +

We have instantiated the framework with several kinds of +token-, tree- and graph-based neural networks that encode and +learn various kinds of information in code. We have applied +the instances of the framework to a code corpus collected from +GitHub containing thousands of Java and C++ programs imple- +menting 50 different algorithms and data structures. Our evalua- +tion results show that the use of Bi-NN indeed produces promising +algorithm classification results both within one language and +across languages, and the encoding of dependencies from code +into the underlying neural networks helps improve algorithm +classification accuracy further. In particular, our custom-built +dependency trees with tree-based convolutional neural networks +achieve the highest classification accuracy among the different +instances of the framework that we have evaluated. Our study +points to a possible future research direction to tailor bilateral +and multilateral neural networks that encode more relevant +semantics for code learning, mining and analysis tasks

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks

+

Nghi D. Q. Bui, Lingxiao Jiang, Yijun Yu. NLSE 2018

+

+ + [ArXiV] + + + +
+ + representation + + grammar + +

+

Towards the vision of translating code that implements an algorithm from one programming language into another, this +paper proposes an approach for automated program classification using +bilateral tree-based convolutional neural networks +(BiTBCNNs). It is layered on top of two tree-based +convolutional neural networks (TBCNNs), each of which recognizes the algorithm of code written in an individual programming language. The combination layer of the networks +recognizes the similarities and differences among code in different programming languages. The BiTBCNNs are trained +using the source code in different languages but known to +implement the same algorithms and/or functionalities. For +a preliminary evaluation, we use 3591 Java and 3534 C++ +code snippets from 6 algorithms we crawled systematically +from GitHub. We obtained over 90% accuracy in the cross-language binary classification task to tell whether any given +two code snippets implement a same algorithm. Also, for the +algorithm classification task, i.e., to predict which one of the +six algorithm labels is implemented by an arbitrary C++ code +snippet, we achieved over 80% precision.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code

+

Nghi D. Q. Bui, Lingxiao Jiang. ICSE 2018

+

+ + [PDF] + + [code] + + + +
+ + representation + +

+

Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach to automatically learn cross-language representations for various kinds of structural code elements that may be used for program translation. Our key idea is two folded: First, we normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings; Second, hierarchically from bottom up, we construct shared embeddings for code elements of higher levels of granularity (e.g., expressions, statements, methods) from the embeddings for their constituents, and then build mappings among code elements across languages based on similarities among embeddings. +Our preliminary evaluations on about 40,000 Java and C# source files from 9 software projects show that our approach can automatically learn shared embeddings for various code elements in different languages and identify their cross-language mappings with reasonable Mean Average Precision scores. When compared with an existing tool for mapping library API methods, our approach identifies many more mappings accurately. The mapping results and code can be accessed at this https URL. We believe that our idea for learning cross-language vector representations with code structural information can be a useful step towards automated program translation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SAR: Learning Cross-Language API Mappings with Little Knowledge

+

N. D. Q. Bui, Y. Yu, L. Jiang. FSE 2019

+

+ + [PDF] + + [code] + + + +
+ + representation + + API + +

+

To save manual effort, developers often translate programs from one programming language to another, instead of implementing it from scratch. Translating application program interfaces (APIs) used in one language to functionally equivalent ones available in another language is an important aspect of program translation. Existing approaches facilitate the translation by automatically identifying the API mappings across programming languages. However, all these approaches still require large amount of manual effort in preparing parallel program corpora, ranging from pairs of APIs, to manually identified code in different languages that are considered as functionally equivalent. To minimize the manual effort in identifying parallel program corpora and API mappings, this paper aims at an automated approach to map APIs across languages with much less knowledge a priori needed than other existing approaches. The approach is based on an realization of the notion of domain adaption combined with code embedding, which can better align two vector spaces: taking as input large sets of programs, our approach first generates numeric vector representations of the programs, especially the APIs used in each language, and it adapts generative adversarial networks (GAN) to align the vectors from the spaces of two languages. For a better alignment, we initialize the GAN with parameters derived from optional API mapping seeds that can be identified accurately with a simple automatic signature-based matching heuristic. Then the cross-language API mappings can be identified via nearest-neighbors queries in the aligned vector spaces.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

+

Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang. SIGIR 2021

+

+ + [ArXiV] + + + +
+ + pretraining + + search + +

+

We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees

+

Nghi D. Q. Bui, Yijun Yu, Lingxiao Jiang. ICSE 2021

+

+ + [ArXiV] + + + +
+ + representation + +

+

Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other tasks. While some techniques produce representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. Although certain techniques generate representations from unlabeled code when applied to downstream tasks they are far from satisfactory. This paper proposes InferCode to overcome the limitation by adapting the self-supervised learning mechanism to build source code model. The key novelty lies in training code representations by predicting automatically identified subtrees from the context of the ASTs. Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We trained an InferCode model instance using the Tree-based CNN as the encoder of a large set of Java code and applied it to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search or reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model with a significant margin for most tasks including those involving different programming languages.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

TAG : Type Auxiliary Guiding for Code Comment Generation

+

Ruichu Cai, Zhihao Liang, Boyan Xu, Zijian Li, Yuexing Hao, Yao Chen. ACL 2020

+

+ + [ArXiV] + + + +
+ + bimodal + + documentation + +

+

Existing leading code comment generation approaches with the structure-to-sequence framework ignores the type information of the interpretation of the code, e.g., operator, string, etc. However, introducing the type information into the existing framework is non-trivial due to the hierarchical dependence among the type information. In order to address the issues above, we propose a Type Auxiliary Guiding encoder-decoder framework for the code comment generation task which considers the source code as an N-ary tree with type information associated with each node. Specifically, our framework is featured with a Type-associated Encoder and a Type-restricted Decoder which enables adaptive summarization of the source code. We further propose a hierarchical reinforcement learning method to resolve the training difficulties of our proposed framework. Extensive evaluations demonstrate the state-of-the-art performance of our framework with both the auto-evaluated metrics and case studies.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

When Deep Learning Met Code Search

+

Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, Satish Chandra. 2019

+

+ + [ArXiV] + + + +
+ + search + +

+

There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural language queries, into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including unsupervised techniques, which rely only on a corpus of code examples, and supervised techniques, which use an aligned corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet.

+ +

Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a minimal supervision extension to an existing unsupervised technique.

+ +

Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Syntax Errors Just Aren’t Natural: Improving Error Reporting with Language Models

+

Joshua Charles Campbell, Abram Hindle, José Nelson Amaral. MSR 2014

+

+ + + +
+ + repair + + language model + +

+

A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in +many errors reported throughout the file. We seek to find the actual source of these syntax errors by relying on the consistency of software: valid source code is usually repetitive and unsurprising. We exploit this consistency by constructing a simple N-gram language model of lexed source code tokens. We implemented an automatic Java syntax-error locator using the corpus of the project itself and evaluated its performance on mutated source code from several projects. Our tool, trained on the past versions of a project, can effectively augment the syntax error locations produced by the native compiler. Thus we provide a methodology and tool that exploits the naturalness of software source code to detect syntax errors alongside the parser.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks

+

Beatrice Casey, Joanna C. S. Santos, George Perry. 2024

+

+ + [ArXiV] + + + +
+ + survey + + cybersecurity + + vulnerability + +

+

Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what’s not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

+

Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Jacob Ginesin, Edward Berman, George Chakhnashvili, Anton Lozhkov, Carolyn Jane Anderson, Arjun Guha. 2023

+

+ + [ArXiV] + + + +
+ + editing + +

+

A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Hidden Markov Model to Detect Coded Information Islands in Free Text

+

Luigi Cerulo, Michele Ceccarelli, Massimiliano Di Penta, Gerardo Canfora. SCAM 2013

+

+ + + +
+ + information extraction + +

+

Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of +source code and natural language, unstructured text.

+ +

In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens — e.g., words, language keywords, numbers, parentheses, punctuation marks, etc. — observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language.

+ +

We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Irish: A Hidden Markov Model to detect coded information islands in free text

+

Luigi Cerulo, Michele Ceccarelli, Massimiliano Di Penta, Gerardo Canfora. Science of Computer Programming 2015

+

+ + + +
+ + information extraction + +

+

Developers’ communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can +be used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers’ communication can be useful to support +several software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts.

+ +

We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74% and 99%; this is in line with existing approaches which, differently from Irish, require specific expertise for the definition of regular expressions or grammars.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Automatically generating features for learning program analysis heuristics

+

Kwonsoo Chae, Hakjoo Oh, Kihong Heo, Hongseok Yang. 2016

+

+ + [ArXiV] + + + +
+ + representation + +

+

We present a technique for automatically generating features for data-driven program analyses. Recently data-driven approaches for building a program analysis have been proposed, which mine existing codebases and automatically learn heuristics for finding a cost-effective abstraction for a given analysis task. Such approaches reduce the burden of the analysis designers, but they do not remove it completely; they still leave the highly nontrivial task of designing so called features to the hands of the designers. Our technique automates this feature design process. The idea is to use programs as features after reducing and abstracting them. Our technique goes through selected program-query pairs in codebases, and it reduces and abstracts the program in each pair to a few lines of code, while ensuring that the analysis behaves similarly for the original and the new programs with respect to the query. Each reduced program serves as a boolean feature for program-query pairs. This feature evaluates to true for a given program-query pair when (as a program) it is included in the program part of the pair. We have implemented our approach for three real-world program analyses. Our experimental evaluation shows that these analyses with automatically-generated features perform comparably to those with manually crafted features.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CODIT: Code Editing with Tree-Based Neural Machine Translation

+

Saikat Chakraborty, Miltiadis Allamanis, Baishakhi Ray. 2018

+

+ + [ArXiV] + + + +
+ + grammar + + grammar + + repair + + code generation + +

+

The way developers edit day-to-day code tends to be repetitive, often using existing code elements. Many researchers have tried to automate repetitive code changes by learning from specific change templates which are applied to limited scope. The advancement of Neural Machine Translation (NMT) and the availability of vast open-source evolutionary data opens up the possibility of automatically learning those templates from the wild. However, unlike natural languages, for which NMT techniques were originally devised, source code and its changes have certain properties. For instance, compared to natural language, source code vocabulary can be significantly larger. Further, good changes in code do not break its syntactic structure. Thus, deploying state-of-the-art NMT models without adapting the methods to the source code domain yields sub-optimal results. To this end, we propose a novel Tree based NMT system to model source code changes and learn code change patterns from the wild. We realize our model with a change suggestion engine: CODIT and train the model with more than 30k real-world changes and evaluate it on 6k patches. Our evaluation shows the effectiveness of CODIT in learning and suggesting patches.CODIT also shows promise generating bug fix patches.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Learning based Vulnerability Detection: Are We There Yet?

+

Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, Baishakhi Ray. TSE 2021

+

+ + [ArXiV] + + + +
+ + defect + + survey + +

+

Automated detection of software vulnerabilities is a fundamental problem in software security. Existing program analysis techniques either suffer from high false positives or false negatives. Recent progress in Deep Learning (DL) has resulted in a surge of interest in applying DL for automated vulnerability detection. Several recent studies have demonstrated promising results achieving an accuracy of up to 95% at detecting vulnerabilities. In this paper, we ask, “how well do the state-of-the-art DL-based techniques perform in a real-world vulnerability prediction scenario?”. To our surprise, we find that their performance drops by more than 50%. A systematic investigation of what causes such precipitous performance drop reveals that existing DL-based vulnerability prediction approaches suffer from challenges with the training data (e.g., data duplication, unrealistic distribution of vulnerable classes, etc.) and with the model choices (e.g., simple token-based models). As a result, these approaches often do not learn features related to the actual cause of the vulnerabilities. Instead, they learn unrelated artifacts from the dataset (e.g., specific variable/function names, etc.). Leveraging these empirical findings, we demonstrate how a more principled approach to data collection and model design, based on realistic settings of vulnerability prediction, can lead to better solutions. The resulting tools perform significantly better than the studied baseline: up to 33.57% boost in precision and 128.38% boost in recall compared to the best performing model in the literature. Overall, this paper elucidates existing DL-based vulnerability prediction systems’ potential issues and draws a roadmap for future DL-based vulnerability prediction research. In that spirit, we make available all the artifacts supporting our results: https://git.io/Jf6IA

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On Multi-Modal Learning of Editing Source Code

+

Saikat Chakraborty, Baishakhi Ray. 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + edit + +

+

In recent years, Neural Machine Translator (NMT) has shown promise in automatically editing source code. Typical NMT based code editor only considers the code that needs to be changed as input and suggests developers with a ranked list of patched code to choose from - where the correct one may not always be at the top of the list. While NMT based code editing systems generate a broad spectrum of plausible patches, the correct one depends on the developers’ requirement and often on the context where the patch is applied. Thus, if developers provide some hints, using natural language, or providing patch context, NMT models can benefit from them. As a proof of concept, in this research, we leverage three modalities of information: edit location, edit code context, commit messages (as a proxy of developers’ hint in natural language) to automatically generate edits with NMT models. To that end, we build MODIT, a multi-modal NMT based code editing engine. With in-depth investigation and analysis, we show that developers’ hint as an input modality can narrow the search space for patches and outperform state-of-the-art models to generate correctly patched code in top-1 position.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Capturing source code semantics via tree-based convolution over API-enhanced AST

+

Long Chen, Wei Ye, Shikun Zhang. Computing Frontiers 2019

+

+ + + +
+ + grammar + + representation + +

+

When deep learning meets big code, a key question is how to efficiently learn a distributed representation for source code that can capture its semantics effectively. We propose to use tree-based convolution over API-enhanced AST. To demonstrate the effectiveness of our approach, we apply it to detect semantic clones—code fragments with similar semantics but dissimilar syntax. Experiment results show that our approach outperforms an existing state-of-the-art approach that uses tree-based LSTM, with an increase of 0.39 and 0.12 in F1-score on OJClone and BigCloneBench respectively. We further propose architectures that incorporate our approach for code search and code summarization.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Literature Study of Embeddings on Source Code

+

Zimin Chen, Martin Monperrus. 2019

+

+ + [ArXiV] + + + +
+ + representation + +

+

Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and discuss the usage of word embedding techniques on programs and source code. The articles in this survey have been collected by asking authors of related work and with an extensive search on Google Scholar. Each article is categorized into five categories: 1. embedding of tokens 2. embedding of functions or methods 3. embedding of sequences or sets of method calls 4. embedding of binary code 5. other embeddings. We also provide links to experimental data and show some remarkable visualization of code embeddings. In summary, word embedding has been successfully applied on different granularities of source code. With access to countless open-source repositories, we see a great potential of applying other data-driven natural language processing techniques on source code in the future.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Mining Likely Analogical APIs across Third-Party Libraries via Large-Scale Unsupervised API Semantics Embedding

+

Chunyang Chen, Zhenchang Xing, Yang Liu, Kent Ong Long Xiong. TSE 2019

+

+ + + +
+ + API + + representation + +

+

Establishing API mappings between third-party libraries is a prerequisite step for library migration tasks. Manually establishing API mappings is tedious due to the large number of APIs to be examined. Having an automatic technique to create a database of likely API mappings can significantly ease the task. Unfortunately, existing techniques either adopt supervised learning mechanism that requires already-ported or functionality similar applications across major programming languages or platforms, which are difficult to come by for an arbitrary pair of third-party libraries, or cannot deal with lexical gap in the API descriptions of different libraries. To overcome these limitations, we present an unsupervised deep learning based approach to embed both API usage semantics and API description (name and document) semantics into vector space for inferring likely analogical API mappings between libraries. Based on deep learning models trained using tens of millions of API call sequences, method names and comments of 2.8 millions of methods from 135,127 GitHub projects, our approach significantly outperforms other deep learning or traditional information retrieval (IR) methods for inferring likely analogical APIs. We implement a proof-of-concept website which can recommend analogical APIs for 583,501 APIs of 111 pairs of analogical Java libraries with diverse functionalities. This scale of third-party analogical-API database has never been achieved before.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair

+

Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, Martin Monperrus. 2019

+

+ + [ArXiV] + + + +
+ + repair + + code generation + +

+

This paper presents a novel end-to-end approach to program repair based on sequence-to-sequence learning. We devise, implement, and evaluate a system, called SequenceR, for fixing bugs based on sequence-to-sequence learning on source code. This approach uses the copy mechanism to overcome the unlimited vocabulary problem that occurs with big code. Our system is data-driven; we train it on 35,578 commits, carefully curated from open-source repositories. We evaluate it on 4,711 independent real bug fixes, as well on the Defects4J benchmark used in program repair research. SequenceR is able to perfectly predict the fixed line for 950/4711 testing samples. It captures a wide range of repair operators without any domain-specific top-down design.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Evaluating Large Language Models Trained on Code

+

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, Will Guss, Alex Nichol, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Andrew Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba. 2021

+

+ + [ArXiV] + + [Dataset] + + + +
+ + language model + + synthesis + +

+

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair

+

Zimin Chen, Vincent J Hellendoorn, Pascal Lamblin, Petros Maniatis, Pierre-Antoine Manzagol, Daniel Tarlow, Subhodeep Moitra. NeurIPS 2021

+

+ + [NeurIPS Proceedings] + + + +
+ + repair + +

+

Machine learning for understanding and editing source code has recently attracted significant interest, with many developments in new models, new code representations, and new tasks.This proliferation can appear disparate and disconnected, making each approach seemingly unique and incompatible, thus obscuring the core machine learning challenges and contributions.In this work, we demonstrate that the landscape can be significantly simplified by taking a general approach of mapping a graph to a sequence of tokens and pointers.Our main result is to show that 16 recently published tasks of different shapes can be cast in this form, based on which a single model architecture achieves near or above state-of-the-art results on nearly all tasks, outperforming custom models like code2seq and alternative generic models like Transformers.This unification further enables multi-task learning and a series of cross-cutting experiments about the importance of different modeling choices for code understanding and repair tasks.The full framework, called PLUR, is easily extensible to more tasks, and will be open-sourced (https://github.com/google-research/plur).

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeT: Code Generation with Generated Tests

+

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, Weizhu Chen. 2022

+

+ + [ArXiV] + + + +
+ + synthesis + + Transformer + + execution + +

+

Given a programming problem, pre-trained language models such as Codex have demonstrated the ability to generate multiple different code solutions via sampling. However, selecting a correct or best solution from those samples still remains a challenge. While an easy way to verify the correctness of a code solution is through executing test cases, producing high-quality test cases is prohibitively expensive. In this paper, we explore the use of pre-trained language models to automatically generate test cases, calling our method CodeT: Code generation with generated Tests. CodeT executes the code solutions using the generated test cases, and then chooses the best solution based on a dual execution agreement with both the generated test cases and other generated solutions. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks. Extensive experimental results demonstrate CodeT can achieve significant, consistent, and surprising improvements over previous methods. For example, CodeT improves the pass@1 on HumanEval to 65.8%, an increase of absolute 18.8% on the code-davinci-002 model, and an absolute 20+% improvement over previous state-of-the-art results.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Reverse DNNs from AI Programs Automatically

+

Simin Chen, Hamed Khanpour, Cong Liu, Wei Yang. IJCAI-ECAI 2022 2022

+

+ + [ArXiV] + + + +
+ + Reverse Engineering + + Binary Code + +

+

With the privatization deployment of DNNs on edge devices, the security of on-device DNNs has raised significant concern. To quantify the model leakage risk of on-device DNNs automatically, we propose NNReverse, the first learning-based method which can reverse DNNs from AI programs without domain knowledge. NNReverse trains a representation model to represent the semantics of binary code for DNN layers. By searching the most similar function in our database, NNReverse infers the layer type of a given function’s binary code. To represent assembly instructions semantics precisely, NNReverse proposes a more finegrained embedding model to represent the textual and structural-semantic of assembly functions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

+

Yizheng Chen, Zhoujie Ding, Xinyun Chen, David Wagner. 2023

+

+ + [ArXiV] + + + +
+ + dataset + + Transformer + + vulnerability + +

+

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection. +Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. +However, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Supersonic: Learning to Generate Source Code Optimizations in C/C++

+

Zimin Chen, Sen Fang, Martin Monperrus. 2023

+

+ + [ArXiV] + + + +
+ + optimization + +

+

Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic’s performance is benchmarked against OpenAI’s GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models

+

Simin Chen, Xiaoning Feng, Xiaohong Han, Cong Liu, Wei Yang. FSE 2024 2024

+

+ + [ArXiV] + + [Code] + + + +
+ + benchmarking + + evaluation + +

+

In recent times, a plethora of Large Code Generation Models (LCGMs) have been proposed, showcasing significant potential in assisting developers with complex programming tasks. Benchmarking LCGMs necessitates the creation of a set of diverse programming problems, and each problem comprises the prompt (including the task description), canonical solution, and test inputs. The existing methods for constructing such a problem set can be categorized into two main types: manual methods and perturbation-based methods. However, manual methods demand high effort and lack scalability, while also risking data integrity due to LCGMs’ potentially contaminated data collection, and perturbation-based approaches mainly generate semantically homogeneous problems with the same canonical solutions and introduce typos that can be easily auto-corrected by IDE, making them ineffective and unrealistic. In this work, we propose the idea of programming problem merging (PPM) and provide two implementation of this idea, we utilize our tool on two widely-used datasets and compare it against nine baseline methods using eight code generation models. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems, comparing to the baselines.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Scalable Taint Specification Inference with Big Code

+

V. Chibotaru, B. Bichsel, Veselin Raychev, Martin Vechev. PLDI 2019

+

+ + + +
+ + defect + + program analysis + +

+

We present a new scalable, semi-supervised method for inferring +taint analysis specifications by learning from a large dataset of programs. +Taint specifications capture the role of library APIs (source, sink, sanitizer) +and are a critical ingredient of any taint analyzer that aims to detect +security violations based on information flow.

+ +

The core idea of our method +is to formulate the taint specification learning problem as a linear +optimization task over a large set of information flow constraints. +The resulting constraint system can then be efficiently solved with +state-of-the-art solvers. Thanks to its scalability, our method can infer +many new and interesting taint specifications by simultaneously learning from +a large dataset of programs (e.g., as found on GitHub), while requiring +few manual annotations.

+ +

We implemented our method in an end-to-end system, +called Seldon, targeting Python, a language where static specification +inference is particularly hard due to lack of typing information. +We show that Seldon is practically effective: it learned almost 7,000 API +roles from over 210,000 candidate APIs with very little supervision +(less than 300 annotations) and with high estimated precision (67%). +Further,using the learned specifications, our taint analyzer flagged more than +20,000 violations in open source projects, 97% of which were +undetectable without the inferred specifications.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Empirical Study of Transformers for Source Code

+

Nadezhda Chirkova, Sergey Troshin. 2020

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i. e. follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and all consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On the Embeddings of Variables in Recurrent Neural Networks for Source Code

+

Nadezhda Chirkova. NAACL 2021

+

+ + [ArXiV] + + [Code] + + + +
+ + autocomplete + +

+

Source code processing heavily relies on the methods widely used in natural language processing (NLP), but involves specifics that need to be taken into account to achieve higher quality. An example of this specificity is that the semantics of a variable is defined not only by its name but also by the contexts in which the variable occurs. In this work, we develop dynamic embeddings, a recurrent mechanism that adjusts the learned semantics of the variable when it obtains more information about the variable’s role in the program. We show that using the proposed dynamic embeddings significantly improves the performance of the recurrent neural network, in code completion and bug fixing tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Beware of the Unexpected: Bimodal Taint Analysis

+

Yiu Wai Chow, Max Schäfer, Michael Pradel. ISSTA 2023

+

+ + [ArXiV] + + + +
+ + static analysis + +

+

Static analysis is a powerful tool for detecting security vulnerabilities and other programming problems. Global taint tracking, in particular, can spot vulnerabilities arising from complicated data flow across multiple functions. However, precisely identifying which flows are problematic is challenging, and sometimes depends on factors beyond the reach of pure program analysis, such as conventions and informal knowledge. For example, learning that a parameter name of an API function locale ends up in a file path is surprising and potentially problematic. In contrast, it would be completely unsurprising to find that a parameter command passed to an API function execaCommand is eventually interpreted as part of an operating-system command. This paper presents Fluffy, a bimodal taint analysis that combines static analysis, which reasons about data flow, with machine learning, which probabilistically determines which flows are potentially problematic. The key idea is to let machine learning models predict from natural language information involved in a taint flow, such as API names, whether the flow is expected or unexpected, and to inform developers only about the latter. We present a general framework and instantiate it with four learned models, which offer different trade-offs between the need to annotate training data and the accuracy of predictions. We implement Fluffy on top of the CodeQL analysis framework and apply it to 250K JavaScript projects. Evaluating on five common vulnerability types, we find that Fluffy achieves an F1 score of 0.85 or more on four of them across a variety of datasets.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Suggesting Comment Completions for Python using Neural Language Models

+

Adelina Ciurumelea; Sebastian Proksch; Harald C. Gall. SANER 2020

+

+ + [IEEE Xplore] + + + +
+ + bimodal + + autocomplete + + documentation + +

+

Source-code comments are an important communication medium between developers to better understand and maintain software. Current research focuses on auto-generating comments by summarizing the code. However, good comments contain additional details, like important design decisions or required trade-offs, and only developers can decide on the proper comment content. Automated summarization techniques cannot include information that does not exist in the code, therefore fully-automated approaches while helpful, will be of limited use. In our work, we propose to empower developers through a semi-automated system instead. We investigate the feasibility of using neural language models trained on a large corpus of Python documentation strings to generate completion suggestions and obtain promising results. By focusing on confident predictions, we can obtain a top-3 accuracy of over 70%, although this comes at the cost of lower suggestion frequency. Our models can be improved by leveraging context information like the signature and the full body of the method. Additionally, we are able to return good accuracy completions even for new projects, suggesting the generalizability of our approach.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

PyMT5: multi-mode translation of natural language and Python code with transformers

+

Colin B. Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, Neel Sundaresan. EMNLP 2020

+

+ + [ArXiV] + + + +
+ + bimodal + + code generation + + summarization + + documentation + + language model + + pretraining + +

+

Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Distilling Transformers for Neural Cross-Domain Search

+

Colin B. Clement, Chen Wu, Dawn Drain, Neel Sundaresan. 2021

+

+ + [ArXiV] + + + +
+ + search + + Transformer + +

+

Pre-trained transformers have recently clinched top spots in the gamut of natural language tasks and pioneered solutions to software engineering tasks. Even information retrieval has not been immune to the charm of the transformer, though their large size and cost is generally a barrier to deployment. While there has been much work in streamlining, caching, and modifying transformer architectures for production, here we explore a new direction: distilling a large pre-trained translation model into a lightweight bi-encoder which can be efficiently cached and queried. We argue from a probabilistic perspective that sequence-to-sequence models are a conceptually ideal—albeit highly impractical—retriever. We derive a new distillation objective, implementing it as a data augmentation scheme. Using natural language source code search as a case study for cross-domain search, we demonstrate the validity of this idea by significantly improving upon the current leader of the CodeSearchNet challenge, a recent natural language code search benchmark.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

+

Colin B. Clement, Shuai Lu, Xiaoyu Liu, Michele Tufano, Dawn Drain, Nan Duan, Neel Sundaresan, Alexey Svyatkovskiy. 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + language model + + code generation + +

+

Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an architecture-independent approach for leveraging the syntactic hierarchies of source code for incorporating entire file-level context into a fixed-length window. Using concrete syntax trees of each source file we extract syntactic hierarchies and integrate them into context window by selectively removing from view more specific, less relevant scopes for a given task. We evaluate this approach on code generation tasks and joint translation of natural language and source code in Python programming language, achieving a new state-of-the-art in code completion and summarization for Python in the CodeXGLUE benchmark. We also introduce new CodeXGLUE benchmarks for user-experience-motivated tasks: code completion with normalized literals, method body completion/code summarization conditioned on file-level context.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Commit2Vec: Learning Distributed Representations of Code Changes

+

Adelina Ciurumelea; Sebastian Proksch; Harald C. Gall. 2019

+

+ + [ArXiV] + + + +
+ + edit + +

+

Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories).

+ +

In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source changes (i.e., commits). We use this representation to classify security-relevant commits.

+ +

Because our method uses transfer learning (that is, we train a network on a “pretext task” for which abundant labeled data is available, and then we use such network for the target task of commit classification, for which fewer labeled instances are available), we studied the impact of pre-training the network using two different pretext tasks versus a randomly initialized model.

+ +

Our results indicate that representations that leverage the structural information obtained through code syntax outperform token-based representations. Furthermore, the performance metrics obtained when pre-training on a loosely related pretext task with a very large dataset (>10e6 samples) were surpassed when pretraining on a smaller dataset (>10e4 samples) but for a pretext task that is more closely related to the target task.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Embedding Java Classes with code2vec: Improvements from Variable Obfuscation

+

Rhys Compton, Eibe Frank, Panos Patros, Abigail Koay. MSR 2020

+

+ + [ArXiV] + + + +
+ + naming + + adversarial + +

+

Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaining the semantics of the code as much as possible. code2vec is a recently released embedding approach that uses the proxy task of method name prediction to map Java methods to feature vectors. However, experimentation with code2vec shows that it learns to rely on variable names for prediction, causing it to be easily fooled by typos or adversarial attacks. Moreover, it is only able to embed individual Java methods and cannot embed an entire collection of methods such as those present in a typical Java class, making it difficult to perform predictions at the class level (e.g., for the identification of malicious Java classes). Both shortcomings are addressed in the research presented in this paper. We investigate the effect of obfuscating variable names during the training of a code2vec model to force it to rely on the structure of the code rather than specific names and consider a simple approach to creating class-level embeddings by aggregating sets of method embeddings. Our results, obtained on a challenging new collection of source-code classification problems, indicate that obfuscating variable names produces an embedding model that is both impervious to variable naming and more accurately reflects code semantics. The datasets, models, and code are shared for further ML research on source code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Exploring the Use of Deep Learning for Feature Location

+

Christopher S. Corley, Kostadin Damevski, Nicholas A. Kraft. 2015

+

+ + + +
+ + feature location + + representation + +

+

Deep learning models are a class of neural networks. Relative to n-gram models, deep learning models can capture more complex statistical patterns based on smaller training corpora. In this paper we explore the use of a particular deep learning model, document vectors (DVs), for feature location. DVs seem well suited to use with source code, because they both capture the influence of context on each term in a corpus and map terms into a continuous semantic space that encodes semantic relationships such as synonymy. We present preliminary results that show that a feature location technique (FLT) based on DVs can outperform an analogous FLT based on latent Dirichlet allocation (LDA) and then suggest several directions for future work on the use of deep learning models to improve developer effectiveness in feature location.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

End-to-end Deep Learning of Optimization Heuristics

+

Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather. 2017

+

+ + + +
+ + optimization + +

+

Accurate automatic optimization heuristics are necessary for dealing with the complexity and diversity of modern hardware and software. Machine learning is a proven technique for learning such heuristics, but its success is bound by the quality of the features used. These features must be hand crafted by developers through a combination of expert domain knowledge and trial and error. This makes the quality of the final model directly dependent on the skill and available time of the system architect.

+ +

Our work introduces a better way for building heuristics. We develop a deep neural network that learns heuristics over raw code, entirely without using code features. The neural network simultaneously constructs appropriate representations of the code and learns how best to optimize, removing the need for manual feature creation. Further, we show that our neural nets can transfer learning from one optimization problem to another, improving the accuracy of new models, without the help of human experts.

+ +

We compare the effectiveness of our automatically generated heuristics against ones with features hand-picked by experts. We examine two challenging tasks: predicting optimal mapping for heterogeneous parallelism and GPU thread coarsening factors. In 89% of the cases, the quality of our fully automatic heuristics matches or surpasses that of state-of-the-art predictive models using hand-crafted features, providing on average 14% and 12% more performance with no human effort expended on designing features.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Synthesizing benchmarks for predictive modeling

+

Chris Cummins, Pavlos Petoumenos, Zheng Wang, Hugh Leather. CGO 2017

+

+ + + +
+ + optimization + + code generation + +

+

Predictive modeling using machine learning is an effective method for building compiler heuristics, but there is a shortage of benchmarks. Typical machine learning experiments outside of the compilation field train over thousands or millions of examples. In machine learning for compilers, however, there are typically only a few dozen common benchmarks available. This limits the quality of learned models, as they have very sparse training data for what are often high-dimensional feature spaces. What is needed is a way to generate an unbounded number of training programs that finely cover the feature space. At the same time the generated programs must be similar to the types of programs that human developers actually write, otherwise the learning will target the wrong parts of the feature space. We mine open source repositories for program fragments and apply deep learning techniques to automatically construct models for how humans write programs. We sample these models to generate an unbounded number of runnable training programs. The quality of the programs is such that even human developers struggle to distinguish our generated programs from hand-written code. We use our generator for OpenCL programs, CLgen, to automatically synthesize thousands of programs and show that learning over these improves the performance of a state of the art predictive model by 1.27x. In addition, the fine covering of the feature space automatically exposes weaknesses in the feature design which are invisible with the sparse training examples from existing benchmark suites. Correcting these weaknesses further increases performance by 4.30x.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Compiler Fuzzing through Deep Learning

+

Chris Cummins, Pavlos Petoumenos, Alastair Murray, Hugh Leather. ISSTA 2018

+

+ + + +
+ + fuzzing + + code generation + +

+

Random program generation — fuzzing — is an effective technique +for discovering bugs in compilers but successful fuzzers require +extensive development effort for every language supported by the +compiler, and often leave parts of the language space untested.

+ +

We introduce DeepSmith, a novel machine learning approach +to accelerating compiler validation through the inference of generative models for compiler inputs. Our approach +infers a learned +model of the structure of real world code based on a large corpus of open source code. Then, it uses the model to automatically +generate tens of thousands of realistic programs. Finally, we apply +established differential testing methodologies on them to expose +bugs in compilers. We apply our approach to the OpenCL programming language, automatically exposing bugs with little effort on our +side. In 1,000 hours of automated testing of commercial and open +source compilers, we discover bugs in all of them, submitting 67 +bug reports. Our test cases are on average two orders of magnitude +smaller than the state-of-the-art, require 3.03× less time to generate +and evaluate, and expose bugs which the state-of-the-art cannot. +Our random program generator, comprising only 500 lines of code, +took 12 hours to train for OpenCL versus the state-of-the-art taking +9 man months to port from a generator for C and 50,000 lines of +code. With 18 lines of code we extended our program generator to +a second language, uncovering crashes in Solidity compilers in 12 +hours of automated testing.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

+

Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Hugh Leather. 2020

+

+ + [ArXiV] + + [Dataset] + + [Code] + + + +
+ + dataset + + GNN + +

+

The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in two key areas: a representation that accurately captures the semantics of programs, and a model architecture with sufficient expressiveness to reason about this representation.

+ +

We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs. The ProGraML representation is a directed attributed multigraph that captures control, data, and call relations, and summarizes instruction and operand types and ordering. Message Passing Neural Networks propagate information through this structured representation, enabling whole-program or per-vertex classification tasks.

+ +

ProGraML provides a general-purpose program representation that equips learnable models to perform the types of program analysis that are fundamental to optimization. To this end, we evaluate the performance of our approach first on a suite of traditional compiler analysis tasks: control flow reachability, dominator trees, data dependencies, variable liveness, and common subexpression detection. On a benchmark dataset of 250k LLVM-IR files covering six source programming languages, ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches. We then apply our approach to two high-level tasks - heterogeneous device mapping and program classification - setting new state-of-the-art performance in both.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Open Vocabulary Learning on Source Code with a Graph-Structured Cache

+

Milan Cvitkovic, Badal Singh, Anima Anandkumar. 2018

+

+ + [ArXiV] + + + +
+ + GNN + + variable misuse + + defect + + representation + +

+

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models’ performance on a code completion task and a variable naming task — with over 100% relative improvement on the latter — at the cost of a moderate increase in computation time.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A deep language model for software code

+

Hoa Khanh Dam, Truyen Tran, Trang Pham. 2016

+

+ + [ArXiV] + + + +
+ + language model + + code generation + +

+

Existing language models such as n-grams for software code often fail to capture a long context where dependent code elements scatter far apart. In this paper, we propose a novel approach to build a language model for software code to address this particular issue. Our language model, partly inspired by human memory, is built upon the powerful deep learning-based Long Short Term Memory architecture that is capable of learning long-term dependencies which occur frequently in software code. Results from our intrinsic evaluation on a corpus of Java projects have demonstrated the effectiveness of our language model. This work contributes to realizing our vision for DeepSoft, an end-to-end, generic deep learning-based framework for modeling software and its development process.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

RefiNym: Using Names to Refine Types

+

Santanu Dash, Miltiadis Allamanis, Earl T. Barr. FSE 2018

+

+ + + +
+ + program analysis + + types + +

+

Source code is bimodal: it combines a formal algorithmic channel and a natural language channel of identifiers and comments. In this work, we model the bimodality of code with name lows, an assignment low graph augmented to track identiier names. Conceptual types are logically distinct types that do not always coincide with program types. Passwords and URLs are example conceptual types that can share the program type string. Our tool, RefiNym, is an unsupervised method that mines a lattice of conceptual types from name lows and reiies them into distinct nominal types. For string, RefiNym inds and splits conceptual types originally merged into a single type, reducing the number of same-type variables per scope from 8.7 to 2.2 while eliminating 21.9% of scopes that have more than one same-type variable in scope. This makes the code more self-documenting and frees the type system to prevent a developer from inadvertently assigning data across conceptual types.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Reverse Engineering of Stripped Binaries

+

Yaniv David, Uri Alon, Eran Yahav. ICLR 2019

+

+ + [ArXiV] + + + +
+ + naming + + deobfuscation + + GNN + +

+

We address the problem of predicting procedure names in stripped executables which contain no debug information. +Predicting procedure names can dramatically ease the task of reverse engineering, saving precious time and human effort. +We present a novel approach that leverages static analysis of binaries with encoder-decoder-based neural networks. +The main idea is to use static analysis to obtain enriched representations of API call sites; encode a set of sequences +of these call sites; and finally, attend to the encoded sequences while decoding the target name token-by-token. +We evaluate our model by predicting procedure names over 60,000 procedures in 10,000 stripped executables. +Our model achieves 81.70 precision and 80.12 recall in predicting procedure names within GNU packages, and 55.48 +precision and 51.31 recall in a diverse, cross-package, dataset. Comparing to previous approaches, +the predictions made by our model are much more accurate and informative.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Path-Based Function Embedding and its Application to Specification Mining

+

Daniel DeFreez, Aditya V. Thakur, Cindy Rubio-González. ICSE 2018

+

+ + [ArXiV] + + + +
+ + program analysis + + representation + +

+

Identifying the relationships among program elements is useful +for program understanding, debugging, and analysis. One such +relationship is synonymy. Function synonyms are functions that +play a similar role in code, e.g. functions that perform initialization +for different device drivers, or functions that implement different +symmetric-key encryption schemes. Function synonyms are not +necessarily semantically equivalent and can be syntactically dissimilar; consequently, approaches for identifying code clones or +functional equivalence cannot be used to identify them. This paper presents func2vec, an algorithm that maps each function to a vector in a vector space such that function synonyms are grouped +together. We compute the function embedding by training a neu- +ral network on sentences generated from random walks over an +encoding of the program as a labeled pushdown system (ℓ-PDS). +We demonstrate that func2vec +is effective at identifying function +synonyms in the Linux kernel. Furthermore, we show how function +synonyms enable mining error-handling specifications with high +support in Linux file systems and drivers.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CoNCRA: A Convolutional Neural Network Code Retrieval Approach

+

Marcelo de Rezende Martins, Marco Aurélio Gerosa. SBES '20 2020

+

+ + [ArXiV] + + [code] + + + +
+ + search + +

+

Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer’s intent, expressed in natural language. We evaluated our approach’s efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Learning & Software Engineering: State of Research and Future Directions

+

Prem Devanbu, Matthew Dwyer, Sebastian Elbaum, Michael Lowry, Kevin Moran, Denys Poshyvanyk, Baishakhi Ray, Rishabh Singh, Xiangyu Zhang. 2020

+

+ + [ArXiV] + + + +
+ + survey + +

+

Given the current transformative potential of research that sits at the intersection of Deep Learning (DL) and Software Engineering (SE), an NSF-sponsored community workshop was conducted in co-location with the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19) in San Diego, California. The goal of this workshop was to outline high priority areas for cross-cutting research. While a multitude of exciting directions for future work were identified, this report provides a general summary of the research areas representing the areas of highest priority which were discussed at the workshop. The intent of this report is to serve as a potential roadmap to guide future work that sits at the intersection of SE & DL.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Semantic Code Repair using Neuro-Symbolic Transformation Networks

+

Jacob Devlin, Jonathan Uesato, Rishabh Singh, Pushmeet Kohli. 2017

+

+ + [ArXiV] + + + +
+ + repair + +

+

We study the problem of semantic code repair, which can be broadly defined as automatically fixing +non-syntactic bugs in source code. The majority of past work in semantic code repair assumed access +to unit tests against which candidate repairs could be validated. In contrast, the goal here is to +develop a strong statistical model to accurately predict both bug locations and exact fixes without +access to information about the intended correct behavior of the program. Achieving such a goal +requires a robust contextual repair model, which we train on a large corpus of real-world source +code that has been augmented with synthetically injected bugs. Our framework adopts a two-stage +approach where first a large set of repair candidates are generated by rule-based processors, and +then these candidates are scored by a statistical model using a novel neural network architecture +which we refer to as Share, Specialize, and Compete. Specifically, the architecture (1) generates +a shared encoding of the source code using an RNN over the abstract syntax tree, +(2) scores each candidate repair using specialized network modules, and (3) then normalizes these +scores together so they can compete against one another in comparable probability space. We evaluate +our model on a real-world test set gathered from GitHub containing four common categories of bugs. +Our model is able to predict the exact correct repair 41% of the time with a single guess, compared +to 13% accuracy for an attentional sequence-to-sequence model.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

MulCode: A Multi-task Learning Approach for Source Code Understanding

+

Deze Wang, Yue Yu, Shanshan Li, Wei Dong, Ji Wang, Liao Qing. SANER 2021

+

+ + [PDF] + + + +
+ + representation + +

+

Recent years have witnessed the significant rise of Deep Learning (DL) techniques applied to source code. Researchers exploit DL for a multitude of tasks and achieve impressive results. However, most tasks are explored separately, resulting in a lack of generalization of the solutions. In this work, we propose MulCode, a multi-task learning approach for source code understanding that learns unified representation space for tasks, with the pre-trained BERT model for the token sequence and the Tree-LSTM model for abstract syntax trees. Furthermore, we integrate two source code views into a hybrid representation via the attention mechanism and set learnable uncertainty parameters to adjust the tasks’ relationship. We train and evaluate MulCode in three downstream tasks: comment classification, author attribution, and duplicate function detection. In all tasks, MulCode outperforms the state-of-theart techniques. Moreover, experiments on three unseen tasks demonstrate the generalization ability of MulCode compared with state-of-the-art embedding methods.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding

+

Deze Wang, Zhouyang Jia, Shanshan Li, Yue Yu, Yun Xiong, Wei Dong, Xiangke Liao. ICSE 2022

+

+ + [ArXiV] + + [code] + + + +
+ + representation + + language model + +

+

With the great success of pre-trained models, the pretrain-then-finetune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features that are invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to organize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models.

+ +

We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Hoppity: Learning Bug Detection and Repair

+

Elizabeth Dinella, Hanjun Dai, Ziyang Li, Mayur Naik, Le Song, Ke Wang. ICLR 2020

+

+ + [OpenReview] + + [Demo] + + + +
+ + edit + + repair + +

+

We present a learning-based approach to detect and fix a broad range of bugs in Javascript programs. We frame the problem in terms of learning a sequence of graph transformations: given a buggy program modeled by a graph structure, our model makes a sequence of predictions including the position of bug nodes and corresponding graph edits to produce a fix. Unlike previous works that use deep neural networks, our approach targets bugs that are more complex and semantic in nature (i.e.~bugs that require adding or deleting statements to fix). We have realized our approach in a tool called HOPPITY. By training on 338,877 Javascript code change commits on Github, HOPPITY correctly detects and fixes bugs in 9,612 out of 42,365 programs in an end-to-end fashion. Given the bug location and type of the fix, HOPPITY also outperforms the baseline approach by a wide margin.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepMerge: Learning to Merge Programs

+

Elizabeth Dinella, Todd Mytkowicz, Alexey Svyatkovskiy, Christian Bird, Mayur Naik, Shuvendu K. Lahiri. 2021

+

+ + [ArXiV] + + + +
+ + edit + + repair + +

+

Program merging is ubiquitous in modern software development. Although commonly used in most version control systems, text-based merge algorithms are prone to producing spurious merge conflicts: they report a conflict even when program changes do not interfere with each other semantically. Spurious merge conflicts are costly to development as the need for manual intervention stalls modern continuous integration pipelines. We propose a novel data-driven approach to identify and resolve spurious merge conflicts with a sequence-to-sequence machine learning model. We realize our approach in a tool DeepMerge that uses a novel combination of (i) an edit-aware embedding of merge inputs and (ii) a variation of pointer networks to construct resolutions from input segments. We also propose an algorithm to extract ground truth manual resolutions from a code corpus and employ it to curate a dataset comprising 10,729 non-trivial resolutions in Javascript programs. Our evaluation shows that DeepMerge can predict correct resolutions with high precision (72%) and modest recall (34%) on the dataset overall, and high recall (78%) on merges comprising of upto 3 lines that comprise 24% of the dataset.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

TOGA: A Neural Method for Test Oracle Generation

+

Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, Shuvendu K. Lahiri. ICSE 2022

+

+ + [Preprint] + + + +
+ + code generation + + Transformer + + test generation + +

+

Testing is widely recognized as an important stage of the software +development lifecycle. Effective software testing can provide benefits such as bug finding, preventing regressions, and documentation. +In terms of documentation, unit tests express a unit’s intended +functionality, as conceived by the developer. A test oracle, typically expressed as an condition, documents the intended behavior +of a unit under a given test prefix. Synthesizing a functional test +oracle is a challenging problem, as it must capture the intended +functionality rather than the implemented functionality. +In this paper, we propose TOGA (a neural method for Test Oracle +GenerAtion), a unified transformer-based neural approach to infer +both exceptional and assertion test oracles based on the context of +the focal method. Our approach can handle units with ambiguous +or missing documentation, and even units with a missing implementation. We evaluate our approach on both oracle inference accuracy +and functional bug-finding. Our technique improves accuracy by +33% over existing oracle inference approaches, achieving 96% overall accuracy on a held out test dataset. Furthermore, we show that +when integrated with a automated test generation tool (EvoSuite), +our approach finds 57 real world bugs in large-scale Java programs, +including 30 bugs that are not found by any other automated testing +method in our evaluation

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

+

Steven H. H. Ding, Benjamin C. M. Fung, Philippe Charland. IEEE Symposium on Security and Privacy 2019

+

+ + + +
+ + representation + + clone + +

+

Reverse engineering is a manually intensive but necessary technique for understanding the inner workings of new malware, finding vulnerabilities in existing systems, and detecting patent infringements in released software. An assembly clone search engine facilitates the work of reverse engineers by identifying those duplicated or known parts. However, it is challenging to design a robust clone search engine, since there exist various compiler optimization options and code obfuscation techniques that make logically similar assembly functions appear to be very different. A practical clone search engine relies on a robust vector representation of assembly code. However, the existing clone search approaches, which rely on a manual feature engineering process to form a feature vector for an assembly function, fail to consider the relationships between features and identify those unique patterns that can statistically distinguish assembly functions. To address this problem, we propose to jointly learn the lexical semantic relationships and the vector representation of assembly functions based on assembly code. We have developed an assembly code representation learning model \emph{Asm2Vec}. It only needs assembly code as input and does not require any prior knowledge such as the correct mapping between assembly functions. It can find and incorporate rich semantic relationships among tokens appearing in assembly code. We conduct extensive experiments and benchmark the learning model with state-of-the-art static and dynamic clone search approaches. We show that the learned representation is more robust and significantly outperforms existing methods against changes introduced by obfuscation and optimizations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Contrastive Learning for Source Code with Structural and Functional Properties

+

Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, Saikat Chakraborty. 2021

+

+ + [ArXiV] + + + +
+ + representation + + pretraining + + Transformer + +

+

Pre-trained transformer models have recently shown promises for understanding the source code. Most existing works expect to understand code from the textual features and limited structural knowledge of code. However, the program functionalities sometimes cannot be fully revealed by the code sequence, even with structure information. Programs can contain very different tokens and structures while sharing the same functionality, but changing only one or a few code tokens can introduce unexpected or malicious program behaviors while preserving the syntax and most tokens. In this work, we present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code. We first employ automated, structure-guided code transformation algorithms that generate (i.) functionally equivalent code that looks drastically different from the original one, and (ii.) textually and syntactically very similar code that is functionally distinct from the original. We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective. To encode the structure information, we introduce a new node-type masked language model objective that helps the model learn about structural context. We pre-train BOOST with a much smaller dataset than the state-of-the-art models, but our small models can still match or outperform these large models in code understanding and generation tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Static Evaluation of Code Completion by Large Language Models

+

Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, Dan Roth, Bing Xiang. 2023

+

+ + [ArXiV] + + + +
+ + LLM + + static analysis + +

+

Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven’t been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic?

+

Jean-Baptiste Döderlein, Mathieu Acher, Djamel Eddine Khelladi, Benoit Combemale. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently attracted attention in code assistants, with programs automatically written in a given programming language from a programming task description in natural language. They have the potential to save time and effort when writing code. However, these systems are currently poorly understood, preventing them from being used optimally. In this paper, we investigate the various input parameters of two language models, and conduct a study to understand if variations of these input parameters (e.g. programming task description and the surrounding context, creativity of the language model, number of generated solutions) can have a significant impact on the quality of the generated programs. We design specific operators for varying input parameters and apply them over two code assistants (Copilot and Codex) and two benchmarks representing algorithmic problems (HumanEval and LeetCode). Our results showed that varying the input parameters can significantly improve the performance of language models. However, there is a tight dependency when varying the temperature, the prompt and the number of generated solutions, making potentially hard for developers to properly control the parameters to obtain an optimal result. This work opens opportunities to propose (automated) strategies for improving performance.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeScore: Evaluating Code Generation by Learning Code Execution

+

Yihong Dong, Jiazheng Ding, Xue Jiang, Zhuo Li, Ge Li, Zhi Jin. 2023

+

+ + [ArXiV] + + + +
+ + Transformer + + evaluation + +

+

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing CEMs can be categorized into match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) and execution-based CEMs (e.g., AvgPassRatio and Pass@k), but both of them suffer from some issues. The former only measures differences in surface form regardless of the functional equivalence of codes, while the latter has huge execution overheads, including collecting expensive test cases, resolving tedious execution dependencies, and enormous execution time. To address these issues, in this paper, we propose CodeScore, an efficient and effective CEM for code generation, which estimates test case PassRatio of generated code without executing code. We also present a framework named UniCE for training unified code evaluation models by learning code execution, i.e., learning PassRatio and Executability of generated code. In order to learn code execution comprehensively, we construct more than 100 test cases for each task in several popular benchmark datasets, covering MBPP, APPS, and HumanEval. Experimental results show that CodeScore has obtained a state-of-the-art correlation with execution-based CEMs. CodeScore is strongly correlated with AvgPassPatio, and binary CodeScore is moderately correlated with Pass@1. In particular, CodeScore eliminates the need for test cases and execution dependencies in inference, and CodeScore reduces execution time by three orders of magnitude compared to AvgPassPatio and Pass@1.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons

+

Dawn Drain, Colin B. Clement, Guillermo Serrato, Neel Sundaresan. 2021

+

+ + [ArXiV] + + + +
+ + repair + + Transformer + +

+

The joint task of bug localization and program repair is an integral part of the software development process. In this work we present DeepDebug, an approach to automated debugging using large, pretrained transformers. We begin by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs. We apply these synthetic bugs toward two ends. First, we directly train a backtranslation model on all functions from 200K repositories. Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions in those repositories that are covered by passing tests. This provides us with rich debugging information such as stack traces and print statements, which we use to finetune our model which was pretrained on raw source code. Finally, we strengthen all our models by expanding the context window beyond the buggy function itself, and adding a skeleton consisting of that function’s parent class, imports, signatures, docstrings, and method bodies, in order of priority. On the QuixBugs benchmark, we increase the total number of fixes found by over 50%, while also decreasing the false positive rate from 35% to 5% and decreasing the timeout from six hours to one minute. On our own benchmark of executable tests, our model fixes 68% of all bugs on its first attempt without using traces, and after adding traces it fixes 75% on first attempt. We will open-source our framework and validation set for evaluating on executable tests.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Generating Bug-Fixes Using Pretrained Transformers

+

Dawn Drain, Chen Wu, Alexey Svyatkovskiy, Neel Sundaresan. 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + repair + +

+

Detecting and fixing bugs are two of the most important yet frustrating parts of the software development cycle. Existing bug detection tools are based mainly on static analyzers, which rely on mathematical logic and symbolic reasoning about the program execution to detect common types of bugs. Fixing bugs is typically left out to the developer. In this work we introduce DeepDebug: a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub repositories. We frame bug-patching as a sequence-to-sequence learning task consisting of two steps: (i) denoising pretraining, and (ii) supervised finetuning on the target translation task. We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch, while domain-adaptive pretraining from natural language to code further improves the accuracy by another 32%. We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art. In contrast to prior work, we attain our best results when generating raw code, as opposed to working with abstracted code that tends to only benefit smaller capacity models. Finally, we observe a subtle improvement from adding syntax embeddings along with the standard positional embeddings, as well as with adding an auxiliary task to predict each token’s syntactic class. Despite focusing on Java, our approach is language agnostic, requiring only a general-purpose parser such as tree-sitter.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural-Network Guided Expression Transformation

+

Romain Edelmann, Viktor Kunčak. 2019

+

+ + [ArXiV] + + + +
+ + optimization + + grammar + +

+

Optimizing compilers, as well as other translator systems, often work by rewriting expressions according to equivalence preserving rules. Given an input expression and its optimized form, finding the sequence of rules that were applied is a non-trivial task. Most of the time, the tools provide no proof, of any kind, of the equivalence between the original expression and its optimized form. In this work, we propose to reconstruct proofs of equivalence of simple mathematical expressions, after the fact, by finding paths of equivalence preserving transformations between expressions. We propose to find those sequences of transformations using a search algorithm, guided by a neural network heuristic. Using a Tree-LSTM recursive neural network, we learn a distributed representation of expressions where the Manhattan distance between vectors approximately corresponds to the rewrite distance between expressions. We then show how the neural network can be efficiently used to search for transformation paths, leading to substantial gain in speed compared to an uninformed exhaustive search. In one of our experiments, our neural-network guided search algorithm is able to solve more instances with a 2 seconds timeout per instance than breadth-first search does with a 5 minutes timeout per instance.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Unsupervised Learning of API Aliasing Specifications

+

Jan Eberhardt, Samuel Steffen, Veselin Raychev, Martin Vechev. PLDI 2019

+

+ + + +
+ + API + + program analysis + +

+

Real world applications make heavy use of powerful libraries +and frameworks, posing a significant challenge for static analysis +as the library implementation may be very complex or unavailable. +Thus, obtaining specifications that summarize the behaviors of +the library is important as it enables static analyzers to precisely +track the effects of APIs on the client program, without requiring +the actual API implementation.

+ +

In this work, we propose a novel method +for discovering aliasing specifications of APIs by learning from a large +dataset of programs. Unlike prior work, our method does not require +manual annotation, access to the library’s source code or ability to +run its APIs. Instead, it learns specifications in a fully unsupervised manner, +by statically observing usages of APIs in the dataset. The core idea is to +learn a probabilistic model of interactions between API methods and aliasing +objects, enabling identification of additional likely aliasing relations, +and to then infer aliasing specifications ofAPIs that explain these relations. +The learned specifications are then used to augment an API-aware points-to analysis.

+ +

We implemented our approach in a tool called USpec and used it to automatically +learn aliasing specifications from millions of source code files. +USpec learned over 2000 specifications of various Java and Python APIs, in the process +improving the results of the points-to analysis and its clients.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Semantic Source Code Models Using Identifier Embeddings

+

Vasiliki Efstathiou, Diomidis Spinellis. MSR 2019

+

+ + + +
+ + representation + +

+

The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data-driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state-of-the-art library for learning word representations. Each model is trained on data from a single programming language; the code mined for producing all models amounts to over 13.000 repositories. We indicate dissimilarities between natural language and source code, as well as variations in coding conventions in between the different programming languages we processed. We describe how these heterogeneities guided the data preprocessing decisions we took and the selection of the training parameters in the released models. Finally, we propose potential applications of the models and discuss limitations of the models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code

+

Aryaz Eghbali, Michael Pradel. ASE 2022

+

+ + [Preprint] + + + +
+ + evaluation + +

+

Recent years have brought a surge of work on predicting pieces +of source code, e.g., for code completion, code migration, program +repair, or translating natural language into code. All this work faces +the challenge of evaluating the quality of a prediction w.r.t. some +oracle, typically in the form of a reference solution. A common +evaluation metric is the BLEU score, an n-gram-based metric originally proposed for evaluating natural language translation, but +adopted in software engineering because it can be easily computed +on any programming language and enables automated evaluation at +scale. However, a key difference between natural and programming +languages is that in the latter, completely unrelated pieces of code +may have many common n-grams simply because of the syntactic +verbosity and coding conventions of programming languages. We +observe that these trivially shared n-grams hamper the ability of +the metric to distinguish between truly similar code examples and +code examples that are merely written in the same language. This +paper presents CrystalBLEU, an evaluation metric based on BLEU, +that allows for precisely and efficiently measuring the similarity of +code. Our metric preserves the desirable properties of BLEU, such +as being language-agnostic, able to handle incomplete or partially +incorrect code, and efficient, while reducing the noise caused by +trivially shared n-grams. We evaluate CrystalBLEU on two datasets +from prior work and on a new, labeled dataset of semantically equivalent programs. Our results show that CrystalBLEU can distinguish +similar from dissimilar code examples 1.9–4.5 times more effectively, when compared to the original BLEU score and a previously +proposed variant of BLEU for code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DreamCoder: bootstrapping inductive program synthesis with wake-sleep library learning

+

Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lucas Morales, Luke Hewitt, Armando Solar-Lezama, Joshua B. Tenenbaum. 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021) 2021

+

+ + [ArXiV] + + [Paper] + + [Code] + + + +
+ + synthesis + + search + +

+

We present a system for inductive program synthesis called DreamCoder, which inputs a corpus of synthesis problems each specified by one or a few examples, and automatically derives a library of program components and a neural search policy that can be used to efficiently solve other similar synthesis problems. The library and search policy bootstrap each other iteratively through a variant of “wake-sleep” approximate Bayesian learning. A new refactoring algorithm based on E-graph matching identifies common sub-components across synthesized programs, building a progressively deepening library of abstractions capturing the structure of the input domain. We evaluate on eight domains including classic program synthesis areas and AI tasks such as planning, inverse graphics, and equation discovery. We show that jointly learning the library and neural search policy leads to solving more problems, and solving them more quickly.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing

+

Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, Burkhard Rost. 2021

+

+ + [ArXiV] + + [Code] + + [Models] + + + +
+ + Transformer + +

+

Currently, a growing number of mature natural language processing applications make people’s life more convenient. Such applications are built by source code - the language in software engineering. However, the applications for understanding source code language to ease the software engineering process are under-researched. Simultaneously, the transformer model, especially its combination with transfer learning, has been proven to be a powerful technique for natural language processing tasks. These breakthroughs point out a promising direction for process source code and crack software engineering tasks. This paper describes CodeTrans - an encoder-decoder transformer model for tasks in the software engineering domain, that explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks. Moreover, we have investigated the effect of different training strategies, including single-task learning, transfer learning, multi-task learning, and multi-task learning with fine-tuning. CodeTrans outperforms the state-of-the-art models on all the tasks. To expedite future works in the software engineering domain, we have published our pre-trained models of CodeTrans.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Automatically Testing Functional Properties of Code Translation Models

+

Hasan Ferit Eniser, Valentin Wüstholz, Maria Christakis. AAAI 2023

+

+ + [ArXiV] + + + +
+ + translation + +

+

Large language models are becoming increasingly practical for translating code across programming languages, a process known as $transpiling$. Even though automated transpilation significantly boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, property-based testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

+

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou. 2020

+

+ + [ArXiV] + + + +
+ + pretraining + +

+

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Structured Neural Summarization

+

Patrick Fernandes, Miltiadis Allamanis, Marc Brockschmidt. ICLR 2019

+

+ + [OpenReview] + + [ArXiV] + + [OpenGNN] + + [Code] + + + +
+ + summarization + + GNN + + documentation + +

+

Summarization of long sequences into a concise statement is a core problem in natural language processing, requiring non-trivial understanding of the input. Based on the promising results of graph neural networks on highly structured data, we develop a framework to extend existing sequence encoders with a graph component that can reason about long-distance relationships in weakly structured data such as text. In an extensive evaluation, we show that the resulting hybrid sequence-graph models outperform both pure sequence models as well as pure graph models on a range of summarization tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Parameter-Free Probabilistic API Mining across GitHub

+

Jaroslav Fowkes, Charles Sutton. FSE 2016

+

+ + + +
+ + API + + pattern mining + +

+

Existing API mining algorithms can be difficult to use as they require expensive parameter tuning and the returned set of API calls can be large, highly redundant and difficult to understand. To address this, we present PAM (Probabilistic API Miner), a near parameter-free probabilistic algorithm for mining the most interesting API call patterns. We show that PAM significantly outperforms both MAPO and UPMiner, achieving 69% test-set precision, at retrieving relevant API call sequences from GitHub. Moreover, we focus on libraries for which the developers have explicitly provided code examples, yielding over 300,000 LOC of hand-written API example code from the 967 client projects in the data set. This evaluation suggests that the hand-written examples actually have limited coverage of real API usages.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Autofolding for Source Code Summarization

+

Jaroslav Fowkes, Razan Ranca, Miltiadis Allamanis, Mirella Lapata, Charles Sutton. TSE 2017

+

+ + + +
+ + summarization + +

+

Developers spend much of their time reading and browsing source code, raising new opportunities for summarization methods. Indeed, modern code editors provide code folding, which allows one to selectively hide blocks of code. However this is impractical to use as folding decisions must be made manually or based on simple rules. We introduce the +autofolding problem, which is to automatically create a code summary by folding less informative code regions. We present a novel solution by formulating the problem as a sequence of AST folding decisions, leveraging a scoped topic model for code tokens. On an annotated set of popular open source projects, we show that our summarizer outperforms simpler baselines, yielding a 28% error reduction. Furthermore, we find through a case study that our summarizer is strongly preferred by experienced developers. More broadly, we hope this work will aid program comprehension by turning code folding into a usable and valuable tool.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CACHECA: A Cache Language Model Based Code Suggestion Tool

+

Christine Franks, Zhaopeng Tu, Premkumar Devanbu, Vincent Hellendoorn. ICSE 2015

+

+ + + +
+ + language model + +

+

Nearly every Integrated Development Environment includes a form of code completion. The suggested completions (“suggestions”) are typically based on information available at compile time, such as type signatures and variables in scope. A statistical approach, based on estimated models of code patterns in large code corpora, has been demonstrated to be effective at predicting tokens given a context. In this demo, we present CACHECA, an Eclipse plugin that combines the native suggestions with a statistical suggestion regime. We demonstrate that a combination of the two approaches more than doubles Eclipse’s suggestion accuracy. A video demonstration is available at https://www.youtube.com/watch?v=3INk0N3JNtc.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

InCoder: A Generative Model for Code Infilling and Synthesis

+

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, Mike Lewis. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + code generation + + naming + + summarization + +

+

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first generative model that is able to directly perform zero-shot code infilling, which we evaluate on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The InCoder models and code are publicly released at https://sites.google.com/view/incoder-code-models

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Coda: An End-to-End Neural Program Decompiler

+

Cheng Fu, Huili Chen, Haolan Liu, Xinyun Chen, Yuandong Tian, Farinaz Koushanfar, Jishen Zhao. NeurIPS 2019

+

+ + [Proceedings] + + + +
+ + decompilation + +

+

Reverse engineering of binary executables is a critical problem in the computer security domain. On the one hand, malicious parties may recover interpretable source codes from the software products to gain commercial advantages. On the other hand, binary decompilation can be leveraged for code vulnerability analysis and malware detection. However, efficient binary decompilation is challenging. Conventional decompilers have the following major limitations: (i) they are only applicable to specific source-target language pair, hence incurs undesired development cost for new language tasks; (ii) their output high-level code cannot effectively preserve the correct functionality of the input binary; (iii) their output program does not capture the semantics of the input and the reversed program is hard to interpret. To address the above problems, we propose Coda1, the first end-to-end neural-based framework for code decompilation. Coda decomposes the decompilation task into of two key phases: First, Coda employs an instruction type-aware encoder and a tree decoder for generating an abstract syntax tree (AST) with attention feeding during the code sketch generation stage. Second, Coda then updates the code sketch using an iterative error correction machine guided by an ensembled neural error predictor. By finding a good approximate candidate and then fixing it towards perfect, Coda achieves superior with performance compared to baseline approaches. We assess Coda’s performance with extensive experiments on various benchmarks. Evaluation results show that Coda achieves an average of 82% program recovery accuracy on unseen binary samples, where the state-of-the-art decompilers yield 0% accuracy. Furthermore, Coda outperforms the sequence-to-sequence model with attention by a margin of 70% program accuracy. Our work reveals the vulnerability of binary executables and imposes a new threat to the protection of Intellectual Property (IP) for software development.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Neural Model for Method Name Generation from Functional Description

+

Sa Gao, Chunyang Chen, Zhenchang Xing, Yukun Ma, Wen Song, Shang-Wei Lin. SANER 2019

+

+ + + +
+ + naming + + summarization + +

+

The names of software artifacts, e.g., method names, are important for software understanding and maintenance, as good names can help developers easily understand others’ code. However, the existing naming guidelines are difficult for developers, especially novices, to come up with meaningful, concise and compact names for the variables, methods, classes and files. With the popularity of open source, an enormous amount of project source code can be accessed, and the exhaustiveness and instability of manually naming methods could now be relieved by automatically learning a naming model from a large code repository. Nevertheless, building a comprehensive naming system is still challenging, due to the gap between natural language functional descriptions and method names. Specifically, there are three challenges: how to model the relationship between the functional descriptions and formal method names, how to handle the explosion of vocabulary when dealing with large repositories, and how to leverage the knowledge learned from large repositories to a specific project. To answer these questions, we propose a neural network to directly generate readable method names from natural language description. The proposed method is built upon the encoder-decoder framework with the attention and copying mechanisms. Our experiments show that our method can generate meaningful and accurate method names and achieve significant improvement over the state-of-the-art baseline models. We also address the cold-start problem using a training trick to utilize big data in GitHub for specific projects.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepPERF: A Deep Learning-Based Approach For Improving Software Performance

+

Spandan Garg, Roshanak Zilouchian Moghaddam, Colin B. Clement, Neel Sundaresan, Chen Wu. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + optimization + +

+

Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepPERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepPERF on English and Source code corpora and followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in ~53% of the cases, getting ~34% of them verbatim in our expert-verified dataset of performance changes made by C# developers. Additionally, we evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that our model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations. So far we’ve submitted 19 pull-requests with 28 different performance optimizations and 11 of these PRs have been approved by the project owners.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

T5APR: Empowering Automated Program Repair across Languages through Checkpoint Ensemble

+

Reza Gharibi, Mohammad Hadi Sadreddini, Seyed Mostafa Fakhrahmad. 2024

+

+ + [ArXiV] + + [Code] + + + +
+ + repair + + Transformer + +

+

Automated program repair (APR) using deep learning techniques has become an important area of research in recent years, aiming to automatically generate bug-fixing patches that can improve software reliability and maintainability. However, most existing methods either target a single language or require high computational resources to train multilingual models. In this paper, we propose T5APR, a novel neural program repair approach that provides a unified solution for bug fixing across multiple programming languages. T5APR leverages CodeT5, a powerful pre-trained text-to-text transformer model, and adopts a checkpoint ensemble strategy to improve patch recommendation. We conduct comprehensive evaluations on six well-known benchmarks in four programming languages (Java, Python, C, JavaScript), demonstrating T5APR’s competitiveness against state-of-the-art techniques. T5APR correctly fixes 1,985 bugs, including 1,442 bugs that none of the compared techniques has fixed. We further support the effectiveness of our approach by conducting detailed analyses, such as comparing the correct patch ranking among different techniques. The findings of this study demonstrate the potential of T5APR for use in real-world applications and highlight the importance of multilingual approaches in the field of APR.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On the Naturalness and Localness of Software Logs

+

Sina Gholamian, Paul A. S. Ward. 2021

+

+ + + +
+ + logging + + language model + +

+

Logs are an essential part of the development and +maintenance of large and complex software systems as they +contain rich information pertaining to the dynamic content and +state of the system. As such, developers and practitioners rely +heavily on the logs to monitor their systems. In parallel, the +increasing volume and scale of the logs, due to the growing +complexity of modern software systems, renders the traditional +way of manual log inspection insurmountable. Consequently, to +handle large volumes of logs efficiently and effectively, various +prior research aims to automate the analysis of log files. Thus, in +this paper, we begin with the hypothesis that log files are natural +and local and these attributes can be applied for automating log +analysis tasks. We guide our research with six research questions +with regards to the naturalness and localness of the log files, and +present a case study on anomaly detection and introduce a tool +for anomaly detection, called ANALOG, to demonstrate how our +new findings facilitate the automated analysis of logs.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

OverCode: visualizing variation in student solutions to programming problems at scale

+

Elena L. Glassman, Jeremy Scott, Rishabh Singh, Philip J. Guo, Robert C. Miller. TOCHI 2015

+

+ + + +
+ + repair + +

+

In MOOCs, a single programming exercise may produce thousands of solutions from learners. Understanding solution variation is important for providing appropriate feedback to students at scale. The wide variation among these solutions can be a source of pedagogically valuable examples and can be used to refine the autograder for the exercise by exposing corner cases. We present OverCode, a system for visualizing and exploring thousands of programming solutions. OverCode uses both static and dynamic analysis to cluster similar solutions, and lets teachers further filter and cluster solutions based on different criteria. We evaluated OverCode against a nonclustering baseline in a within-subjects study with 24 teaching assistants and found that the OverCode interface allows teachers to more quickly develop a high-level view of students’ understanding and misconceptions, and to provide feedback that is relevant to more students’ solutions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A case study on machine learning for synthesizing benchmarks

+

Andrés Goens, Alexander Brauckmann, Sebastian Ertel, Chris Cummins, Hugh Leather, Jeronimo Castrillon. MAPL 2019

+

+ + + +
+ + code generation + +

+

Good benchmarks are hard to find because they require a substantial effort to keep them representative for the constantly changing challenges of a particular field. Synthetic benchmarks are a common approach to deal with this, and methods from machine learning are natural candidates for synthetic benchmark generation. In this paper we investigate the usefulness of machine learning in the prominent CLgen benchmark generator. We re-evaluate CLgen by comparing the benchmarks generated by the model with the raw data used to train it. This re-evaluation indicates that, for the use case considered, machine learning did not yield additional benefit over a simpler method using the raw data. We investigate the reasons for this and provide further insights into the challenges the problem could pose for potential future generators.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code to Comment "Translation": Data, Metrics, Baselining & Evaluation

+

David Gros, Hariharan Sezhiyan, Premkumar Devanbu, Zhou Yu. 2020

+

+ + [ArXiV] + + + +
+ + bimodal + + documentation + +

+

The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep learning methods to this task, and specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CodeNN, DeepCom, FunCom, and DocString. We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using “affinity pairs” of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep API Learning

+

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim.. FSE 2016

+

+ + + +
+ + API + + search + +

+

Developers often wonder how to implement a certain functionality (e.g., how to parse XML files) using APIs. Obtaining an API usage sequence based on an API-related natural language query is very helpful in this regard. Given a query, existing approaches utilize information retrieval models to search for matching API sequences. These approaches treat queries and APIs as bag-of-words (i.e., keyword matching or word-to-word alignment) and lack a deep understanding of the semantics of the query.

+ +

We propose DeepAPI, a deep learning based approach to generate API usage sequences for a given natural language query. Instead of a bags-of-words assumption, it learns the +sequence of words in a query and the sequence of associated APIs. DeepAPI adapts a neural language model named RNN Encoder-Decoder. It encodes a word sequence (user query) into a fixed-length context vector, and generates an API sequence based on the context vector. We also augment the RNN Encoder-Decoder by considering the importance of individual APIs. We empirically evaluate our approach with more than 7 million annotated code snippets collected from GitHub. The results show that our approach generates largely accurate API sequences and outperforms the related approaches.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning

+

Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, Sunghun Kim. IJCAI 2017

+

+ + [ArXiV] + + + +
+ + API + +

+

Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a large-scale code corpus without bilingual projects. The key component of DeepAM is based on the multimodal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings, when compared with the state-of-the-art approaches.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Code Search

+

Xiaodong Gu, Hongyu Zhang, Sunghun Kim.. ICSE 2018

+

+ + + +
+ + search + +

+

To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code.

+ +

In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled.

+ +

As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Cross-Language Binary-Source Code Matching with Intermediate Representations

+

Yi Gui, Yao Wan, Hongyu Zhang, Huifang Huang, Yulei Sui, Guandong Xu, Zhiyuan Shao, Hai Jin. SANER 2022

+

+ + [ArXiV] + + [Code] + + + +
+ + code similarity + + clone + +

+

Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

NLyze: Interactive Programming by Natural Language for SpreadSheet Data Analysis and Manipulation

+

Sumit Gulwani, Mark Marron. SIGMOD 2014

+

+ + + +
+ + code generation + + bimodal + + synthesis + +

+

Millions of computer end users need to perform tasks over tabular spreadsheet data, yet lack the programming knowledge to do such tasks automatically. This paper describes +the design and implementation of a robust natural language +based interface to spreadsheet programming. Our methodology involves designing a typed domain-specific language +(DSL) that supports an expressive algebra of map, filter, reduce, join, and formatting capabilities at a level of abstraction appropriate for non-expert users. The key algorithmic +component of our methodology is a translation algorithm +for converting a natural language specification in the context of a given spreadsheet to a ranked set of likely programs +in the DSL. The translation algorithm leverages the spreadsheet spatial and temporal context to assign interpretations +to specifications with implicit references, and is thus robust +to a variety of ways in which end users can express the same +task. The translation algorithm builds over ideas from keyword programming and semantic parsing to achieve both +high precision and high recall. We implemented the system +as an Excel add-in called NLyze that supports a rich user +interaction model including annotating the user’s natural +language specification and explaining the synthesized DSL +programs by paraphrasing them into structured English. We +collected a total of 3570 English descriptions for 40 spreadsheet tasks and our system was able to generate the intended +interpretation as the top candidate for 94% (97% for the top +3) of those instances.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Semantically enhanced software traceability using deep learning techniques

+

Jin Guo, Jinghui Cheng, Jane Cleland-Huang. ICSE 2017

+

+ + + +
+ + traceability + + representation + +

+

In most safety-critical domains the need for traceability is prescribed by certifying bodies. Trace links are generally created among requirements, design, source code, test cases and other artifacts; however, creating such links manually is time consuming and error prone. Automated solutions use information retrieval and machine learning techniques to generate trace links; however, current techniques fail to understand semantics of the software artifacts or to integrate domain knowledge into the tracing process and therefore tend to deliver imprecise and inaccurate results. In this paper, we present a solution that uses deep learning to incorporate requirements artifact semantics and domain knowledge into the tracing solution. We propose a tracing network architecture that utilizes Word Embedding and Recurrent Neural Network (RNN) models to generate trace links. Word embedding learns word vectors that represent knowledge of the domain corpus and RNN uses these word vectors to learn the sentence semantics of requirements artifacts. We trained 360 different configurations of the tracing network using existing trace links in the Positive Train Control domain and identified the Bidirectional Gated Recurrent Unit (BI-GRU) as the best model for the tracing task. BI-GRU significantly out-performed state-of-the-art tracing methods including the Vector Space Model and Latent Semantic Indexing.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

GraphCodeBERT: Pre-training Code Representations with Data Flow

+

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang, Ming Zhou. 2020

+

+ + [ArXiV] + + + +
+ + pretraining + +

+

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of “where-the-value-comes-from” between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Complete Code with Sketches

+

Daya Guo, Alexey Svyatkovskiy, Jian Yin, Nan Duan, Marc Brockschmidt, Miltiadis Allamanis. ICLR 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + language model + + grammar + +

+

Code completion is usually cast as a language modelling problem, i.e., continuing an input in a left-to-right fashion. However, in practice, some parts of the completion (e.g., string literals) may be very hard to predict, whereas subsequent parts directly follow from the context. To handle this, we instead consider the scenario of generating code completions with “holes” inserted in places where a model is uncertain. We develop Grammformer, a Transformer-based model that guides code generation by the programming language grammar, and compare it to a variety of more standard sequence models.

+ +

We train the models on code completion for C# and Python given partial code context. To evaluate models, we consider both ROUGE as well as a new metric RegexAcc that measures success of generating completions matching long outputs with as few holes as possible. In our experiments, Grammformer generates 10-50% more accurate completions compared to traditional generative models and 37-50% longer sketches compared to sketch-generating baselines trained with similar techniques.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

+

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, Jian Yin. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

+

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang. 2024

+

+ + [ArXiV] + + + +
+ + Transformers + +

+

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepFix: Fixing Common C Language Errors by Deep Learning

+

Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. AAAI 2017

+

+ + + +
+ + repair + + code generation + +

+

The problem of automatically fixing programming errors is a +very active research topic in software engineering. This is a +challenging problem as fixing even a single error may require +analysis of the entire program. In practice, a number of errors +arise due to programmer’s inexperience with the programming language or lack of attention to detail. We call these +common programming errors. These are analogous to grammatical errors in natural languages. Compilers detect such errors, but their error messages are usually inaccurate. In this +work, we present an end-to-end solution, called DeepFix, that +can fix multiple such errors in a program without relying on +any external tool to locate or fix them. At the heart of DeepFix +is a multi-layered sequence-to-sequence neural network with +attention which is trained to predict erroneous program locations along with the required correct statements. On a set of +6971 erroneous C programs written by students for 93 programming tasks, DeepFix could fix 1881 (27%) programs +completely and 1338 (19%) programs partially.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Reinforcement Learning for Programming Language Correction

+

Rahul Gupta, Aditya Kanade, Shirish Shevade. 2018

+

+ + [ArXiV] + + [Video] + + + +
+ + repair + + code generation + +

+

Novice programmers often struggle with the formal +syntax of programming languages. To assist them, +we design a novel programming language correction framework amenable to reinforcement learning. The framework allows an agent to mimic human actions for text navigation and editing. We +demonstrate that the agent can be trained through +self-exploration directly from the raw input, that is, +program text itself, without any knowledge of the +formal syntax of the programming language. We +leverage expert demonstrations for one tenth of the +training data to accelerate training. The proposed +technique is evaluated on 6975 +erroneous C programs with typographic errors, written by students +during an introductory programming course. Our +technique fixes 14% +more programs and 29% more +compiler error messages relative to those fixed by +a state-of-the-art tool, DeepFix, which uses a fully +supervised neural machine translation approach.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Intelligent code reviews using deep learning

+

Anshul Gupta, Neel Sundaresan. KDD 2018

+

+ + + +
+ + representation + + review + +

+

Peer code review is a best practice in Software Engineering where source code is reviewed manually by one or more peers(reviewers) of the code author. It is widely acceptable both in industry and open-source software (OSS) systems as a process for early detection and reduction of software defects. A larger chunk of reviews given during peer reviews are related to common issues such as coding style, documentations, and best practices. This makes the code review process less effective as reviewers focus less on finding important defects. Hence, there is a need to automatically find such common issues and help reviewers perform focused code reviews. Some of this is solved by rule based systems called linters but they are rigid and needs a lot of manual effort to adapt them for a new issue.

+ +

In this work, we present an automatic, flexible, and adaptive code analysis system called DeepCodeReviewer (DCR). DCR learns how to recommend code reviews related to common issues using historical peer reviews and deep learning. DCR uses deep learning to learn review relevance to a code snippet and recommend the right review from a repository of common reviews. DCR is trained on histroical peer reviews available from internal code repositories at Microsoft. Experiments demonstrate strong performance of developed deep learning model in classifying relevant and non-relevant reviews w.r.t to a code snippet, and ranking reviews given a code snippet. We have also evaluated DCR recommentations using a user study and survey. The results of our user study show good acceptance rate and answers of our survey questions are strongly correlated with our system’s goal of making code reviews focused on finding defects.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Attribution for Semantic Bug-Localization in Student Programs

+

Rahul Gupta, Aditya Kanade, Shirish Shevade. NeurIPS 2019

+

+ + + +
+ + defect + + representation + +

+

Providing feedback is an integral part of teaching. Most open online courses on programming make use of automated grading systems to support programming assignments and give real-time feedback. These systems usually rely on test results to quantify the programs’ functional correctness. They return failing tests to the students as feedback. However, students may find it difficult to debug their programs if they receive no hints about where the bug is and how to fix it. In this work, we present NeuralBugLocator, a deep learning based technique, that can localize the bugs in a faulty program with respect to a failing test, without even running the program. At the heart of our technique is a novel tree convolutional neural network which is trained to predict whether a program passes or fails a given test. To localize the bugs, we analyze the trained network using a state-of-the-art neural prediction attribution technique and see which lines of the programs make it predict the test outcomes. Our experiments show that NeuralBugLocator is generally more accurate than two state-of-the-art program-spectrum based and one syntactic difference based bug-localization baselines.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Grace: Language Models Meet Code Edits

+

Priyanshu Gupta, Avishree Khare, Yasharth Bajpai, Saikat Chakraborty, Sumit Gulwani, Aditya Kanade, Arjun Radhakrishna, Gustavo Soares, Ashish Tiwari. FSE 2023

+

+ + [ACM] + + + +
+ + editing + +

+

Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) with the knowledge of relevant prior associated edits, which we call the Grace (Generation conditioned on Associated Code Edits) method. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, Grace boosts the performance of the LLMs significantly, enabling them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Synthesizing Java expressions from free-form queries

+

Tihomir Gvero, Viktor Kuncak. OOPSLA 2015

+

+ + + +
+ + synthesis + + code generation + + bimodal + +

+

We present a new code assistance tool for integrated development environments. Our system accepts as input free-form queries containing a mixture of English and Java, and produces Java code expressions that take the query into account and respect syntax, types, and scoping rules of Java, as well as statistical usage patterns. In contrast to solutions based on code search, the results returned by our tool need not directly correspond to any previously seen code fragment. As part of our system we have constructed a probabilistic context free grammar for Java constructs and library invocations, as well as an algorithm that uses a customized natural language processing tool chain to extract information from free-form text queries. We present the results on a number of examples showing that our technique (1) often produces the expected code fragments, (2) tolerates much of the flexibility of natural language, and (3) can repair incorrect Java expressions that use, for example, the wrong syntax or missing arguments.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Bug Finding: A Study of Opportunities and Challenges

+

Andrew Habib, Michael Pradel. 2019

+

+ + [ArXiV] + + + +
+ + program analysis + +

+

Static analysis is one of the most widely adopted techniques to find software bugs before code is put in production. Designing and implementing effective and efficient static analyses is difficult and requires high expertise, which results in only a few experts able to write such analyses. This paper explores the opportunities and challenges of an alternative way of creating static bug detectors: neural bug finding. The basic idea is to formulate bug detection as a classification problem, and to address this problem with neural networks trained on examples of buggy and non-buggy code. We systematically study the effectiveness of this approach based on code examples labeled by a state-of-the-art, static bug detector. Our results show that neural bug finding is surprisingly effective for some bug patterns, sometimes reaching a precision and recall of over 80%, but also that it struggles to understand some program properties obvious to a traditional analysis. A qualitative analysis of the results provides insights into why neural bug finders sometimes work and sometimes do not work. We also identify pitfalls in selecting the code examples used to train and validate neural bug finders, and propose an algorithm for selecting effective training data.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SampleFix: Learning to Correct Programs by Sampling Diverse Fixes

+

Hossein Hajipour, Apratim Bhattacharyya, Cristian-Alexandru Staicu, Mario Fritz. 2019

+

+ + [ArXiV] + + + +
+ + repair + + code generation + +

+

Automatic program correction is an active topic of research, which holds the potential of dramatically improving productivity of programmers during the software development process and correctness of software in general. Recent advances in machine learning, deep learning and NLP have rekindled the hope to eventually fully automate the process of repairing programs. A key challenges is ambiguity, as multiple codes – or fixes – can implement the same functionality. In addition, dataset by nature fail to capture the variance introduced by such ambiguities. Therefore, we propose a deep generative model to automatically correct programming errors by learning a distribution of potential fixes. Our model is formulated as a deep conditional variational autoencoder that samples diverse fixes for the given erroneous programs. In order to account for ambiguity and inherent lack of representative datasets, we propose a novel regularizer to encourage the model to generate diverse fixes. Our evaluations on common programming errors show for the first time the generation of diverse fixes and strong improvements over the state-of-the-art approaches by fixing up to 61% of the mistakes.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Multi-Perspective Architecture for Semantic Code Search

+

Rajarshi Haldar, Lingfei Wu, Jinjun Xiong, Julia Hockenmaier. ACL 2020

+

+ + [ArXiV] + + + +
+ + search + +

+

The ability to match pieces of code to their corresponding natural language descriptions and vice versa is fundamental for natural language search interfaces to software repositories. In this paper, we propose a novel multi-perspective cross-lingual neural framework for code–text matching, inspired in part by a previous model for monolingual text-to-text matching, to capture both global and local similarities. Our experiments on the CoNaLa dataset show that our proposed model yields better performance on this cross-lingual text-to-code matching task than previous approaches that map code and text to a single joint embedding space.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Improved Automatic Summarization of Subroutines via Attention to File Context

+

Sakib Haque, Alexander LeClair, Lingfei Wu, Collin McMillan. 2020

+

+ + [ArXiV] + + + +
+ + summarization + +

+

Software documentation largely consists of short, natural language summaries of the subroutines in the software. These summaries help programmers quickly understand what a subroutine does without having to read the source code him or herself. The task of writing these descriptions is called “source code summarization” and has been a target of research for several years. Recently, AI-based approaches have superseded older, heuristic-based approaches. Yet, to date these AI-based approaches assume that all the content needed to predict summaries is inside subroutine itself. This assumption limits performance because many subroutines cannot be understood without surrounding context. In this paper, we present an approach that models the file context of subroutines (i.e. other subroutines in the same file) and uses an attention mechanism to find words and concepts to use in summaries. We show in an experiment that our approach extends and improves several recent baselines.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Semantic Similarity Metrics for Evaluating Source Code Summarization

+

Sakib Haque, Zachary Eberhart, Aakash Bansal, Collin McMillan. 2022

+

+ + [ArXiV] + + + +
+ + human evaluation + + evaluation + +

+

Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Repair Software Vulnerabilities with Generative Adversarial Networks

+

Jacob Harer, Onur Ozdemir, Tomo Lazovich, Christopher P. Reale, Rebecca L. Russell, Louis Y. Kim, Peter Chin. NeurIPS 2018

+

+ + [ArXiV] + + + +
+ + repair + + code generation + +

+

Motivated by the problem of automated repair of software vulnerabilities, we propose an adversarial learning approach that maps from one discrete source domain to another target domain without requiring paired labeled examples or source and target domains to be bijections. We demonstrate that the proposed adversarial learning approach is an effective technique for repairing software vulnerabilities, performing close to seq2seq approaches that require labeled pairs. The proposed Generative Adversarial Network approach is application-agnostic in that it can be applied to other problems similar to code repair, such as grammar correction or sentiment translation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Retrieve-and-Edit Framework for Predicting Structured Outputs

+

Tatsunori B. Hashimoto, Kelvin Guu, Yonatan Oren, Percy S. Liang. NeurIPS 2018

+

+ + + +
+ + bimodal + + search + + code generation + +

+

For the task of generating complex outputs such as source code, editing existing +outputs can be easier than generating complex outputs from scratch. With this +motivation, we propose an approach that first retrieves a training example based on +the input (e.g., natural language description) and then edits it to the desired output +(e.g., code). Our contribution is a computationally efficient method for learning +a retrieval model that embeds the input in a task-dependent way without relying +on a hand-crafted metric or incurring the expense of jointly training the retriever +with the editor. Our retrieve-and-edit framework can be applied on top of any +base model. We show that on a new autocomplete task for GitHub Python code +and the Hearthstone cards benchmark, retrieve-and-edit significantly boosts the +performance of a vanilla sequence-to-sequence model on both tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Generate Corrective Patches using Neural Machine Translation

+

Hideaki Hata, Emad Shihab, Graham Neubig. 2018

+

+ + [ArXiV] + + + +
+ + repair + + code generation + +

+

Bug fixing is generally a manually-intensive task. However, recent work has proposed the idea of automated program repair, which aims to repair (at least a subset of) bugs in different ways such as code mutation, etc. Following in the same line of work as automated bug repair, in this paper we aim to leverage past fixes to propose fixes of current/future bugs. Specifically, we propose Ratchet, a corrective patch generation system using neural machine translation. By learning corresponding pre-correction and post-correction code in past fixes with a neural sequence-to-sequence model, Ratchet is able to generate a fix code for a given bug-prone code query. We perform an empirical study with five open source projects, namely Ambari, Camel, Hadoop, Jetty and Wicket, to evaluate the effectiveness of Ratchet. Our findings show that Ratchet can generate syntactically valid statements 98.7% of the time, and achieve an F1-measure between 0.41-0.83 with respect to the actual fixes adopted in the code base. In addition, we perform a qualitative validation using 20 participants to see whether the generated statements can be helpful in correcting bugs. Our survey showed that Ratchet’s output was considered to be helpful in fixing the bugs on many occasions, even if fix was not 100% correct.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data

+

Moshe Hazoom, Vibhor Malik, Ben Bogin. NLP4Prog 2021

+

+ + [PDF] + + + +
+ + dataset + +

+

Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Fuzz from Symbolic Execution with Application to Smart Contracts

+

Jingxuan He, Mislav Balunović, Nodar Ambroladze, Petar Tsankov, Martin Vechev. CCS 2019

+

+ + [Preprint] + + + +
+ + fuzzing + + GNN + +

+

Fuzzing and symbolic execution are two complementary techniques for discovering software vulnerabilities. Fuzzing is fast and scalable, but can be ineffective when it fails to randomly select the right inputs. Symbolic execution is thorough but slow and often does not scale to deep program paths with complex path conditions. In this work, we propose to learn an effective and fast fuzzer from symbolic execution, by phrasing the learning task in the framework of imitation learning. During learning, a symbolic execution expert generates a large number of quality inputs improving coverage on thousands of programs. Then, a fuzzing policy, represented with a suitable architecture of neural networks, is trained on the generated dataset. The learned policy can then be used to fuzz new programs. We instantiate our approach to the problem of fuzzing smart contracts, a domain where contracts often implement similar functionality (facilitating learning) and security is of utmost importance. We present an end-to-end system, ILF (for Imitation Learning based Fuzzer), and an extensive evaluation over >18K contracts. Our results show that ILF is effective: (i) it is fast, generating 148 transactions per second, (ii) it outperforms existing fuzzers (e.g., achieving 33% more coverage), and (iii) it detects more vulnerabilities than existing fuzzing and symbolic execution tools for Ethereum.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Find Naming Issues with Big Code and Small Supervision

+

Jingxuan He, Cheng-Chun Lee, Veselin Raychev, Martin Vechev. PLDI 2021

+

+ + + +
+ + repair + +

+

We introduce a new approach for finding and fixing naming +issues in source code. The method is based on a careful +combination of unsupervised and supervised procedures: (i) +unsupervised mining of patterns from Big Code that express +common naming idioms. Program fragments violating such +idioms indicates likely naming issues, and (ii) supervised +learning of a classifier on a small labeled dataset which filters +potential false positives from the violations.

+ +

We implemented our method in a system called +Namer and evaluated it on a large number of Python and Java programs. +We demonstrate that Namer is effective in finding naming mistakes +in real world repositories with high precision (∼70%). +Perhaps surprisingly, we also show that existing deep learning methods +are not practically effective and achieve low precision in finding naming issues (up to ∼16%).

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On Distribution Shift in Learning-based Bug Detectors

+

Jingxuan He, Luca Beurer-Kellner, Martin Vechev. 2022

+

+ + [ArXiV] + + [Dataset] + + + +
+ + defect + +

+

Deep learning has recently achieved initial success in program analysis tasks such as bug detection. Lacking real bugs, most existing works construct training and test data by injecting synthetic bugs into correct programs. Despite achieving high test accuracy (e.g. >90%), the resulting bug detectors are found to be surprisingly unusable in practice, i.e., <10% precision when used to scan real software repositories. In this work, we argue that this massive performance difference is caused by distribution shift, i.e., a fundamental mismatch between the real bug distribution and the synthetic bug distribution used to train and evaluate the detectors. To address this key challenge, we propose to train a bug detector in two phases, first on a synthetic bug distribution to adapt the model to the bug detection domain, and then on a real bug distribution to drive the model towards the real distribution. During these two phases, we leverage a multi-task hierarchy, focal loss, and contrastive learning to further boost performance. We evaluate our approach extensively on three widely studied bug types, for which we construct new datasets carefully designed to capture the real bug distribution. The results demonstrate that our approach is practically effective and successfully mitigates the distribution shift: our learned detectors are highly performant on both our constructed test set and the latest version of open source repositories.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Will they like this? Evaluating Code Contributions With Language Models

+

Vincent J. Hellendoorn, Premkumar Devanbu, Alberto Bacchelli. MSR 2015

+

+ + [Paper] + + + +
+ + review + + language model + +

+

Popular open-source software projects receive and +review contributions from a diverse array of developers, many +of whom have little to no prior involvement with the project. A +recent survey reported that reviewers consider conformance to +the project’s code style to be one of the top priorities when evaluating code contributions on Github. We propose to quantitatively +evaluate the existence and effects of this phenomenon. To this aim +we use language models, which were shown to accurately capture +stylistic aspects of code. We find that rejected changesets do +contain code significantly less similar to the project than accepted +ones; furthermore, the less similar changesets are more likely +to be subject to thorough review. Armed with these results we +further investigate whether new contributors learn to conform to +the project style and find that experience is positively correlated +with conformance to the project’s code style.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Are Deep Neural Networks the Best Choice for Modeling Source Code?

+

Vincent J. Hellendoorn, Premkumar Devanbu. FSE 2017

+

+ + [Paper] + + [Slides] + + [Code] + + + +
+ + language model + +

+

Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source +code. We argue here that the special properties of source code can +be exploited for further improvements. In this work, we enhance +established language modeling approaches to handle the special +challenges of modeling source code, such as: frequent changes, +larger, changing vocabularies, deeply nested scopes, etc. We present +a fast, nested language modeling toolkit specifically designed for +software, with the ability to add & remove text, and mix & swap out +many models. Specifically, we improve upon prior cache-modeling +work and present a model with a much more expansive, multi-level +notion of locality that we show to be well-suited for modeling +software. We present results on varying corpora in comparison +with traditional N -gram, as well as RNN, and LSTM deep-learning +language models, and release all our source code for public use. +Our evaluations suggest that carefully adapting N-gram models for +source code can yield performance that surpasses even RNN and +LSTM based deep-learning models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Learning Type Inference

+

V. J. Hellendoorn, Christian Bird, Earl T. Barr, Miltiadis Allamanis. FSE 2018

+

+ + + +
+ + representation + + types + +

+

Dynamically typed languages such as JavaScript and Python are +increasingly popular, yet static typing has not been totally eclipsed: +Python now supports type annotations and languages like TypeScript offer a middle-ground for JavaScript: a strict superset of +JavaScript, to which it transpiles, coupled with a type system that +permits partially typed programs. However, static typing has a cost: +adding annotations, reading the added syntax, and wrestling with +the type system to fix type errors. Type inference can ease the +transition to more statically typed code and unlock the benefits of +richer compile-time information, but is limited in languages like +JavaScript as it cannot soundly handle duck-typing or runtime evaluation +via eval. We propose DeepTyper, a deep learning model +that understands which types naturally occur in certain contexts +and relations and can provide type suggestions, which can often +be verified by the type checker, even if it could not infer the type +initially. DeepTyper, leverages an automatically aligned corpus +of tokens and types to accurately predict thousands of variable +and function type annotations. Furthermore, we demonstrate that +context is key in accurately assigning these types and introduce a +technique to reduce overfitting on local cues while highlighting the +need for further improvements. Finally, we show that our model +can interact with a compiler to provide more than 4,000 additional +type annotations with over 95% precision that could not be inferred +without the aid of DeepTyper.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Global Relational Models of Source Code

+

Vincent J. Hellendoorn, Charles Sutton, Rishab Singh, Petros Maniatis, David Bieber. ICLR 2020

+

+ + [OpenReview] + + + +
+ + variable misuse + + defect + + GNN + + Transformer + +

+

Models of code can learn distributed representations of a program’s syntax and semantics to predict many non-trivial properties of a program. Recent state-of-the-art models leverage highly structured representations of programs, such as trees, graphs and paths therein (e.g. data-flow relations), which are precise and abundantly available for code. This provides a strong inductive bias towards semantically meaningful relations, yielding more generalizable representations than classical sequence-based models. Unfortunately, these models primarily rely on graph-based message passing to represent relations in code, which makes them de facto local due to the high cost of message-passing steps, quite in contrast to modern, global sequence-based models, such as the Transformer. In this work, we bridge this divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers (GREAT for short), which bias traditional Transformers with relational information from graph edge types. By studying a popular, non-trivial program repair task, variable-misuse identification, we explore the relative merits of traditional and hybrid model families for code representation. Starting with a graph-based model that already improves upon the prior state-of-the-art for this task by 20%, we show that our proposed hybrid models improve an additional 10-15%, while training both faster and using fewer parameters.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Semantic Robustness of Models of Source Code

+

Jordan Henkel, Goutham Ramakrishnan, Zi Wang, Aws Albarghouthi, Somesh Jha, Thomas Reps. SANER 2022

+

+ + [PDF] + + [IEEE] + + [ArXiV] + + [Code] + + + +
+ + adversarial + + naming + +

+

Deep neural networks are vulnerable to adversarial examples - small input perturbations that result in incorrect predictions. We study this problem for models of source code, where we want the neural network to be robust to source-code modifications that preserve code functionality. To facilitate training robust models, we define a powerful and generic adversary that can employ sequences of parametric, semantics-preserving program transformations. We then explore how, with such an adversary, one can train models that are robust to adversarial program transformations. We conduct a thorough evaluation of our approach and find several surprising facts: we find robust training to beat dataset augmentation in every evaluation we performed; we find that a state-of-the-art architecture (code2seq) for models of code is harder to make robust than a simpler baseline; additionally, we find code2seq to have surprising weaknesses not present in our simpler baseline model; finally, we find that robust models perform better against unseen data from different sources (as one might hope) - however, we also find that robust models are not clearly better in the cross-language transfer task. To the best of our knowledge, we are the first to study the interplay between robustness of models of code and the domain-adaptation and cross-language transfer tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent

+

Geert Heyman, Tom Van Cutsem. 2020

+

+ + [ArXiV] + + + +
+ + search + +

+

In this work, we propose and study annotated code search: the retrieval of code snippets paired with brief descriptions of their intent using natural language queries. On three benchmark datasets, we investigate how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets. Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description. We find that our model yields significantly more relevant search results (with absolute gains up to 20.6% in mean reciprocal rank) compared to state-of-the-art code retrieval methods that do not use descriptions but attempt to compute the intent of snippets solely from unannotated code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On the Naturalness of Software

+

Abram Hindle, Earl T. Barr, Mark Gabel, Zhendong Su, Premkumar Devanbu. ICSE 2012

+

+ + + +
+ + language model + + autocomplete + +

+

Natural languages like English are rich, complex, +and powerful. The highly creative and graceful use of languages +like English and Tamil, by masters like Shakespeare and +Avvaiyar, can certainly delight and inspire. But in practice, +given cognitive constraints and the exigencies of daily life, most +human utterances are far simpler and much more repetitive +and predictable. In fact, these utterances can be very usefully +modeled using modern statistical methods. This fact has led +to the phenomenal success of statistical approaches to speech +recognition, natural language translation, question-answering, +and text mining and comprehension.

+ +

We begin with the conjecture that most software is also +natural, in the sense that it is created by humans at work, +with all the attendant constraints and limitations—and thus, +like natural language, it is also likely to be repetitive and +predictable. We then proceed to ask whether a) code can +be usefully modeled by statistical language models and b) +such models can be leveraged to support software engineers. +Using the widely adopted n-gram model, we provide empirical +evidence supportive of a positive answer to both these questions. +We show that code is also very repetitive, and in fact even more +so than natural languages. As an example use of the model, +we have developed a simple code completion engine for Java +that, despite its simplicity, already improves Eclipse’s built-in +completion capability. We conclude the paper by laying out a +vision for future research in this area.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CC2Vec: Distributed Representations of Code Changes

+

Thong Hoang, Hong Jin Kang, Julia Lawall, David Lo. ICSE 2020

+

+ + [ArXiV] + + [code] + + + +
+ + edit + +

+

Existing work on software patches often use features specific to a single task. These works often rely on manually identified features, and human effort is required to identify these features for each task. In this work, we propose CC2Vec, a neural network model that learns a representation of code changes guided by their accompanying log messages, which represent the semantic intent of the code changes. CC2Vec models the hierarchical structure of a code change with the help of the attention mechanism and uses multiple comparison functions to identify the differences between the removed and added code.

+ +

To evaluate if CC2Vec can produce a distributed representation of code changes that is general and useful for multiple tasks on software patches, we use the vectors produced by CC2Vec for three tasks: log message generation, bug fixing patch identification, and just-in-time defect prediction. In all tasks, the models using CC2Vec outperform the state-of-the-art techniques.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Fix-Filter-Fix: Intuitively Connect Any Models for Effective Bug Fixing

+

Haiwen Hong, Jingfeng Zhang, Yin Zhang, Yao Wan, Yulei Sui. EMNLP 2021

+

+ + [Proceedings] + + + +
+ + repair + +

+

Locating and fixing bugs is a time-consuming task. Most neural machine translation (NMT) based approaches for automatically bug fixing lack generality and do not make full use of the rich information in the source code. In NMT-based bug fixing, we find some predicted code identical to the input buggy code (called unchanged fix) in NMT-based approaches due to high similarity between buggy and fixed code (e.g., the difference may only appear in one particular line). Obviously, unchanged fix is not the correct fix because it is the same as the buggy code that needs to be fixed. Based on these, we propose an intuitive yet effective general framework (called Fix-Filter-Fix or Fˆ3) for bug fixing. Fˆ3 connects models with our filter mechanism to filter out the last model’s unchanged fix to the next. We propose an Fˆ3 theory that can quantitatively and accurately calculate the Fˆ3 lifting effect. To evaluate, we implement the Seq2Seq Transformer (ST) and the AST2Seq Transformer (AT) to form some basic Fˆ3 instances, called Fˆ3_ST+AT and Fˆ3_AT+ST. Comparing them with single model approaches and many model connection baselines across four datasets validates the effectiveness and generality of Fˆ3 and corroborates our findings and methodology.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Using Web Corpus Statistics for Program Analysis

+

Chun-Hung Hsiao, Michael Cafarella, Satish Narayanasamy. OOPSLA 2014

+

+ + + +
+ + defect + +

+

Several program analysis tools—such as plagiarism detection and bug finding—rely on knowing a piece of code’s +relative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have +an identical simple loop counter test, but should report programs that share more distinctive code. Traditional program +analysis techniques (e.g., finding data and control dependencies) are useful, but do not say how surprising or common +a line of code is. Natural language processing researchers +have encountered a similar problem and addressed it using +an n-gram model of text frequency, derived from statistics +computed over text corpora.

+ +

We propose and compute an n-gram model for programming languages, computed over a corpus of 2.8 million +JavaScript programs we downloaded from the Web. In contrast to previous techniques, we describe a code n-gram as +a subgraph of the program dependence graph that contains +all nodes and edges reachable in n steps from the statement. +We can count n-grams in a program and count the frequency +of n-grams in the corpus, enabling us to compute tf-idf-style +measures that capture the differing importance of different +lines of code. We demonstrate the power of this approach by +implementing a plagiarism detector with accuracy that beats +previous techniques, and a bug-finding tool that discovered +over a dozen previously unknown bugs in a collection of real +deployed programs.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeSum: Translate Program Language to Natural Language

+

Xing Hu, Yuhan Wei, Ge Li, Zhi Jin. 2017

+

+ + [ArXiV] + + + +
+ + bimodal + + summarization + +

+

During software maintenance, programmers spend a lot of time on code comprehension. Reading comments is an effective way for programmers to reduce the reading and navigating time when comprehending source code. Therefore, as a critical task in software engineering, code summarization aims to generate brief natural language descriptions for source code. In this paper, we propose a new code summarization model named CodeSum. CodeSum exploits the attention-based sequence-to-sequence (Seq2Seq) neural network with Structure-based Traversal (SBT) of Abstract Syntax Trees (AST). The AST sequences generated by SBT can better present the structure of ASTs and keep unambiguous. We conduct experiments on three large-scale corpora in different program languages, i.e., Java, C#, and SQL, in which Java corpus is our new proposed industry code extracted from Github. Experimental results show that our method CodeSum outperforms the state-of-the-art significantly.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CoSQA: 20,000+ Web Queries for Code Search and Question Answering

+

Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, Nan Duan. ACL 2021

+

+ + [ArXiV] + + [Code] + + + +
+ + dataset + + search + +

+

Finding codes given natural language query is beneficial to the productivity of software developers. +Future progress towards better semantic matching between query and code requires richer supervised training resources. +To remedy this, we introduce the CoSQA dataset. It includes 20,604 labels for pairs of natural language queries and codes, +each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

+

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, Marc Brockschmidt. 2019

+

+ + [ArXiV] + + [Code and other info] + + [Leaderboard] + + + +
+ + dataset + + search + +

+

Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas.

+ +

To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task.

+ +

We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Transfer Learning for Source Code Modeling

+

Yasir Hussain, Zhiqiu Huang, Yu Zhou, Senzhang Wang. 2019

+

+ + [ArXiV] + + + +
+ + pretraining + +

+

In recent years, deep learning models have shown great potential in source code modeling and analysis. Generally, deep learning-based approaches are problem-specific and data-hungry. A challenging issue of these approaches is that they require training from starch for a different related problem. In this work, we propose a transfer learning-based approach that significantly improves the performance of deep learning-based source code models. In contrast to traditional learning paradigms, transfer learning can transfer the knowledge learned in solving one problem into another related problem. First, we present two recurrent neural network-based models RNN and GRU for the purpose of transfer learning in the domain of source code modeling. Next, via transfer learning, these pre-trained (RNN and GRU) models are used as feature extractors. Then, these extracted features are combined into attention learner for different downstream tasks. The attention learner leverages from the learned knowledge of pre-trained models and fine-tunes them for a specific downstream task. We evaluate the performance of the proposed approach with extensive experiments with the source code suggestion task. The results indicate that the proposed approach outperforms the state-of-the-art models in terms of accuracy, precision, recall, and F-measure without training the models from scratch.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Summarizing Source Code using a Neural Attention Model

+

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer. ACL 2016

+

+ + + +
+ + summarization + + bimodal + +

+

High quality source code is often paired +with high level summaries of the computation it performs, for example in code +documentation or in descriptions posted +in online forums. Such summaries are +extremely useful for applications such as +code search but are expensive to manually +author, hence only done for a small fraction of all code that is produced. In this +paper, we present the first completely data-driven approach for generating high level +summaries of source code. Our model, +CODE-NN , uses Long Short Term Memory (LSTM) networks with attention to +produce sentences that describe C# code +snippets and SQL queries. CODE-NN +is trained on a new corpus that is automatically collected from StackOverflow, +which we release. Experiments demonstrate strong performance on two tasks: +(1) code summarization, where we establish the first end-to-end learning results +and outperform strong baselines, and (2) +code retrieval, where our learned model +improves the state of the art on a recently +introduced C# benchmark by a large margin.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Mapping Language to Code in Programmatic Context

+

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer. EMNLP 2018

+

+ + + +
+ + bimodal + + code generation + +

+

Source code is rarely written in isolation. It depends significantly on the programmatic context, such as the class that the code would reside in. To study this phenomenon, we introduce the task of generating class member functions given English documentation and the programmatic context provided by the rest of the class. This task is challenging because the desired code can vary greatly depending on the functionality the class provides (e.g., a sort function may or may not be available when we are asked to “return the smallest element” in a particular member variable list). We introduce CONCODE, a new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment. We also present a detailed error analysis suggesting that there is significant room for future work on this task.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Programmatic Idioms for Scalable Semantic Parsing

+

Srinivasan Iyer, Alvin Cheung, Luke Zettlemoyer. 2019

+

+ + + +
+ + pattern mining + + code generation + + grammar + +

+

Programmers typically organize executable source code using high-level coding patterns or idiomatic structures such as nested loops, exception handlers and recursive blocks, rather than as individual code tokens. In contrast, state of the art semantic parsers still map natural language instructions to source code by building the code syntax tree one node at a time. In this paper, we introduce an iterative method to extract code idioms from large source code corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax trees, and we train semantic parsers to apply these idioms during decoding. We apply this idiom-based code generation to a recent context-dependent semantic parsing task, and improve the state of the art by 2.2% BLEU score while reducing training time by more than 50%. This improved speed enables us to scale up the model by training on an extended training set that is 5x times larger, to further move up the state of the art by an additional 2.3% BLEU and 0.9% exact match.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Contrastive Code Representation Learning

+

Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph E. Gonzalez, Ion Stoica. 2020

+

+ + [ArXiV] + + [Website] + + [GitHub] + + + +
+ + representation + + pretraining + +

+

Machine-aided programming tools such as type predictors and code summarizers +are increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised +algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on +the raw text of programs. In particular, we design an unsupervised pretext task by +generating textually divergent copies of source functions via automated source-tosource compiler transforms that preserve semantics. We train a neural model to +identify variants of an anchor program within a large batch of negatives. To solve +this task, the network must extract program features representing the functionality, +not form, of the program. This is the first application of instance discrimination +to code representation learning to our knowledge. We pre-train models over 1.8m +unannotated JavaScript methods mined from GitHub. ContraCode pre-training +improves code summarization accuracy by 7.9% over supervised approaches and +4.8% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves +the accuracy of existing baselines.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing

+

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Luke Zettlemoyer. 2019

+

+ + [ArXiV] + + + +
+ + representation + +

+

Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand a large amount of existing code before they can develop new features or fix bugs in existing programs. Being able to process programming language code automatically and provide summaries of code functionality accurately can significantly help developers to reduce time spent in code navigation and understanding, and thus increase productivity. Different from natural language articles, source code in programming languages often follows rigid syntactical structures and there can exist dependencies among code elements that are located far away from each other through complex control flows and data flows. Existing studies on tree-based convolutional neural networks (TBCNN) and gated graph neural networks (GGNN) are not able to capture essential semantic dependencies among code elements accurately. In this paper, we propose novel tree-based capsule networks (TreeCaps) and relevant techniques for processing program code in an automated way that encodes code syntactical structures and captures code dependencies more accurately. Based on evaluation on programs written in different programming languages, we show that our TreeCaps-based approach can outperform other approaches in classifying the functionalities of many programs.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Type Annotation: Is Big Data Enough?

+

Kevin Jesse, Premkumar Devanbu, Toufique Ahmed. FSE 2021

+

+ + + +
+ + Transformer + + types + +

+

TypeScript is a widely used optionally-typed language where developers can adopt “pay as you go” typing: they can add types as +desired, and benefit from static typing. The “type annotation tax” +or manual effort required to annotate new or existing TypeScript +can be reduced by a variety of automatic methods. Probabilistic +machine-learning (ML) approaches work quite well. ML approaches +use different inductive biases, ranging from simple token sequences +to complex graphical neural network (GNN) models capturing syntax and semantic relations. More sophisticated inductive biases are +hand-engineered to exploit the formal nature of software. Rather +than deploying fancy inductive biases for code, can we just use “big +data” to learn natural patterns relevant to typing? We find evidence +suggesting that this is the case. We present TypeBert, demonstrating that even with simple token-sequence inductive bias used in +BERT-style models and enough data, type-annotation performance +of the most sophisticated models can be surpassed.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning To Predict User-Defined Types

+

Kevin Jesse, Premkumar T. Devanbu, Anand Sawant. TSE 2022

+

+ + + +
+ + Transformer + + types + +

+

TypeScript is a widely adopted gradual typed language where developers can optionally type variables, functions, parameters and more. Probabilistic type inference approaches with ML (machine learning) work well especially for commonly occurring types such as boolean, number, and string. TypeScript permits a wide range of types including developer defined class names and type interfaces. These developer defined types, termed user-defined types, can be written within the realm of language naming conventions. The set of user-defined types is boundless and existing bounded type guessing approaches are an imperfect solution. Existing works either under perform in user-defined types or ignore user-defined types altogether. This work leverages a BERT-style pre-trained model, with multi-task learning objectives, to learn how to type user-defined classes and interfaces. Thus we present DIVERSETYPER, a solution that explores the diverse set of user-defined types by uniquely aligning classes and interfaces declarations to the places in which they are used. DIVERSETYPER surpasses all existing works including those that model user-defined types.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Large Language Models and Simple, Stupid Bugs

+

Kevin Jesse, Toufique Ahmed, Premkumar T. Devanbu, Emily Morgan. 2023

+

+ + [ArXiV] + + + +
+ + Transformer + + defect + +

+

With the advent of powerful neural language models, AI-based systems to assist developers in coding tasks are becoming widely available; Copilot is one such system. Copilot uses Codex, a large language model (LLM), to complete code conditioned on a preceding “prompt”. Codex, however, is trained on public GitHub repositories, viz., on code that may include bugs and vulnerabilities. Previous studies [1], [2] show Codex reproduces vulnerabilities seen in training. In this study, we examine how prone Codex is to generate an interesting bug category, single statement bugs, commonly referred to as simple, stupid bugs or SStuBs in the MSR community. We find that Codex and similar LLMs do help avoid some SStuBs, but do produce known, verbatim SStuBs as much as 2x as likely than known, verbatim correct code. We explore the consequences of the Codex generated SStuBs and propose avoidance strategies that suggest the possibility of reducing the production of known, verbatim SStubs, and increase the possibility of producing known, verbatim fixes.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Multimodal Representation for Neural Code Search

+

Jian Gu, Zimin Chen, Martin Monperrus. ICSME 2021

+

+ + [ArXiV] + + [code] + + + +
+ + search + + representation + +

+

Semantic code search is about finding semantically relevant code snippets for a given natural language query. In the state-of-the-art approaches, the semantic similarity between code and query is quantified as the distance of their representation in the shared vector space. In this paper, to improve the vector space, we introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. We conduct extensive experiments using a single corpus that is large-scale and multi-language: CodeSearchNet. Our results show that both our tree-serialized representations and multimodal learning model improve the performance of code search. Last, we define intuitive quantification metrics oriented to the completeness of semantic and syntactic information of the code data, to help understand the experimental findings.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Assemble Foundation Models for Automatic Code Summarization

+

Jian Gu, Pasquale Salza, Harald C. Gall. SANER 2022

+

+ + [ArXiV] + + [code] + + + +
+ + summarization + + documentation + + language model + +

+

Automatic code summarization is beneficial to software development and maintenance since it reduces the burden of manual tasks. Currently, artificial intelligence is undergoing a paradigm shift. The foundation models pretrained on massive data and finetuned to downstream tasks surpass specially customized models. This trend inspired us to consider reusing foundation models instead of learning from scratch. Based on this, we propose a flexible and robust approach for automatic code summarization based on neural networks. We assemble available foundation models, such as CodeBERT and GPT-2, into a single model named AdaMo. Moreover, we utilize Gaussian noise as the simulation of contextual information to optimize the latent representation. Furthermore, we introduce two adaptive schemes from the perspective of knowledge transfer, namely continuous pretraining and intermediate finetuning, and design intermediate stage tasks for general sequence-to-sequence learning. Finally, we evaluate AdaMo against a benchmark dataset for code summarization, by comparing it with state-of-the-art models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Automatically Generating Commit Messages from Diffs using Neural Machine Translation

+

Siyuan Jiang, Ameer Armaly, Collin McMillan. ASE 2017

+

+ + [ArXiV] + + + +
+ + edit + + bimodal + +

+

Commit messages are a valuable resource in comprehension of software evolution, since they provide a record of changes such as feature additions and bug repairs. Unfortunately, programmers often neglect to write good commit messages. Different techniques have been proposed to help programmers by automatically writing these messages. These techniques are effective at describing what changed, but are often verbose and lack context for understanding the rationale behind a change. In contrast, humans write messages that are short and summarize the high level rationale. In this paper, we adapt Neural Machine Translation (NMT) to automatically “translate” diffs into commit messages. We trained an NMT algorithm using a corpus of diffs and human-written commit messages from the top 1k Github projects. We designed a filter to help ensure that we only trained the algorithm on higher-quality commit messages. Our evaluation uncovered a pattern in which the messages we generate tend to be either very high or very low quality. Therefore, we created a quality-assurance filter to detect cases in which we are unable to produce good messages, and return a warning instead.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

TreeBERT: A Tree-Based Pre-Trained Model for Programming Language

+

Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, Lei Lyu. UAI 2021

+

+ + [ArXiV] + + + +
+ + grammar + + Transformer + +

+

Source code can be parsed into the abstract syntax tree (AST) based on defined syntax rules. However, in pre-training, little work has considered the incorporation of tree structure into the learning process. In this paper, we present TreeBERT, a tree-based pre-trained model for improving programming language-oriented generation tasks. To utilize tree structure, TreeBERT represents the AST corresponding to the code as a set of composition paths and introduces node position embedding. The model is trained by tree masked language modeling (TMLM) and node order prediction (NOP) with a hybrid objective. TMLM uses a novel masking strategy designed according to the tree’s characteristics to help the model understand the AST and infer the missing semantics of the AST. With NOP, TreeBERT extracts the syntactical structure by learning the order constraints of nodes in AST. We pre-trained TreeBERT on datasets covering multiple programming languages. On code summarization and code documentation tasks, TreeBERT outperforms other pre-trained models and state-of-the-art models designed for these tasks. Furthermore, TreeBERT performs well when transferred to the pre-trained unseen programming language.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Graph Structure With A Finite-State Automaton Layer

+

Daniel D. Johnson, Hugo Larochelle, Daniel Tarlow. 2020

+

+ + [ArXiV] + + + +
+ + GNN + + program analysis + +

+

Graph-based neural network models are producing strong results in a number of domains, in part because graphs provide flexibility to encode domain knowledge in the form of relational structure (edges) between nodes in the graph. In practice, edges are used both to represent intrinsic structure (e.g., abstract syntax trees of programs) and more abstract relations that aid reasoning for a downstream task (e.g., results of relevant program analyses). In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. Motivated by their power in program analyses, we consider relations defined by paths on the base graph accepted by a finite-state automaton. We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies on a graph-based POMDP and then training these policies using implicit differentiation. The result is a differentiable Graph Finite-State Automaton (GFSA) layer that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs. Additionally, we combine the GFSA layer with a larger graph-based model trained end-to-end on the variable misuse program understanding task, and find that using the GFSA layer leads to better performance than using hand-engineered semantic edges or other baseline methods for adding learned edge types.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model

+

Tae Hwan Jung. NLP4Prog 2021

+

+ + [PDF] + + + +
+ + dataset + + language model + + Transformer + +

+

Commit message is a document that summarizes source code changes in natural language. A good commit message clearly shows the source code changes, so this enhances collaboration between developers. Therefore, our work is to develop a model that automatically writes the commit message. To this end, we release 345K datasets consisting of code modification and commit messages in six programming languages (Python, PHP, Go, Java, JavaScript, and Ruby). Similar to the neural machine translation (NMT) model, using our dataset, we feed the code modification to the encoder input and the commit message to the decoder input and measure the result of the generated commit message with BLEU-4. Also, we propose the following two training methods to improve the result of generating the commit message: (1) A method of preprocessing the input to feed the code modification to the encoder input. (2) A method that uses an initial weight suitable for the code domain to reduce the gap in contextual representation between programming language (PL) and natural language (NL).

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Automatic Acquisition of Annotated Training Corpora for Test-Code Generation

+

Magdalena Kacmajor, John D. Kelleher.. Information 2019

+

+ + + +
+ +

+

Open software repositories make large amounts of source code publicly available. Potentially, this source code could be used as training data to develop new, machine learning-based programming tools. For many applications, however, raw code scraped from online repositories does not constitute an adequate training dataset. Building on the recent and rapid improvements in machine translation (MT), one possibly very interesting application is code generation from natural language descriptions. One of the bottlenecks in developing these MT-inspired systems is the acquisition of parallel text-code corpora required for training code-generative models. This paper addresses the problem of automatically synthetizing parallel text-code corpora in the software testing domain. Our approach is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular for unit testing. Therefore, we propose synthesizing parallel corpora comprised of parsed test function names serving as code descriptions, aligned with the corresponding function bodies. We present the results of applying one of the state-of-the-art MT methods on such a generated dataset. Our experiments show that a neural MT model trained on our dataset can generate syntactically correct and semantically relevant short Java functions from quasi-natural language descriptions of functionality.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Pre-trained Contextual Embedding of Source Code

+

Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, Kensen Shi. ICML 2020

+

+ + [ArXiV] + + + +
+ + pretraining + +

+

The source code of a program not only serves as a formal description of an executable task, but it also serves to communicate developer intent in a human-readable form. To facilitate this, developers use meaningful identifier names and natural-language documentation. This makes it possible to successfully apply sequence-modeling approaches, shown to be effective in natural-language processing, to source code. A major advancement in natural-language understanding has been the use of pre-trained token embeddings; BERT and other works have further shown that pre-trained contextual embeddings can be extremely powerful and can be fine-tuned effectively for a variety of downstream supervised tasks. Inspired by these developments, we present the first attempt to replicate this success on source code. We curate a massive corpus of Python programs from GitHub to pre-train a BERT model, which we call Code Understanding BERT (CuBERT). We also pre-train Word2Vec embeddings on the same dataset. We create a benchmark of five classification tasks and compare fine-tuned CuBERT against sequence models trained with and without the Word2Vec embeddings. Our results show that CuBERT outperforms the baseline methods by a margin of 2.9-22%. We also show its superiority when fine-tuned with smaller datasets, and over fewer epochs. We further evaluate CuBERT’s effectiveness on a joint classification, localization and repair task involving prediction of two pointers.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Phrase-Based Statistical Translation of Programming Languages

+

S. Karaivanov, Veselin Raychev, Martin Vechev. Onward 2014

+

+ + + +
+ + migration + + code generation + +

+

Phrase-based statistical machine translation approaches have been +highly successful in translating between natural languages and are +heavily used by commercial systems (e.g. Google Translate).

+ +

The main objective of this work is to investigate the applicability of +these approaches for translating between programming languages. +Towards that, we investigated several variants of the phrase-based +translation approach: i) a direct application of the approach to +programming languages, ii) a novel modification of the approach +to incorporate the grammatical structure of the target programming +language (so to avoid generating target programs which do not +parse), and iii) a combination of ii) with custom rules added to +improve the quality of the translation.

+ +

To experiment with the above systems, we investigated machine +translation from C# to Java. For the training, which takes about +60 hours, we used a parallel corpus of 20, 499 C#-to-Java method +translations. We then evaluated each of the three systems above by +translating 1,000 C# methods. Our experimental results indicate +that with the most advanced system, about 60% of the translated +methods compile (the top ranked) and out of a random sample of 50 +correctly compiled methods, 68% (34 methods) were semantically +equivalent to the reference solution.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

+

Rafael-Michael Karampatsis, Charles Sutton. 2019

+

+ + [ArXiV] + + [Code] + + + +
+ + language model + +

+

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the vocabulary to a fixed set of common words. For code, this strong assumption has been shown to have a significant negative effect on predictive performance. But the open vocabulary version of the neural network language models for code have not been introduced in the literature. We present a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names. We employ a segmentation into subword units, subsequences of tokens chosen based on a compression criterion, following previous work in machine translation. Our network achieves best in class performance, outperforming even the state-of-the-art methods of Hellendoorn and Devanbu that are designed specifically to model code. Furthermore, we present a simple method for dynamically adapting the model to a new test project, resulting in increased performance. We showcase our methodology on code corpora in three different languages of over a billion tokens each, hundreds of times larger than in previous work. To our knowledge, this is the largest neural language model for code that has been reported.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

+

Rafael-Michael Karampatsis, Hlib Babii, Romain Robbes Charles Sutton, Andrea Janes. ICSE 2020

+

+ + [Link] + + + +
+ + language model + +

+

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SCELMo: Source Code Embeddings from Language Models

+

Rafael-Michael Karampatsis, Charles Sutton. 2020

+

+ + [ArXiV] + + + +
+ + pretraining + + defect + +

+

Continuous embeddings of tokens in computer programs have been used to support a variety of software development tools, including readability, code search, and program repair. Contextual embeddings are common in natural language processing but have not been previously applied in software engineering. We introduce a new set of deep contextualized word representations for computer programs based on language models. We train a set of embeddings using the ELMo (embeddings from language models) framework of Peters et al (2018). We investigate whether these embeddings are effective when fine-tuned for the downstream task of bug detection. We show that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

What do pre-trained code models know about code?

+

Anjan Karmakar, Romain Robbes. 2021

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However, whether the vector representations from these pre-trained models comprehensively encode characteristics of source code well enough to be applicable to a broad spectrum of downstream tasks remains an open question.

+ +

One way to investigate this is with diagnostic tasks called probes. In this paper, we construct four probing tasks (probing for surface-level, syntactic, structural, and semantic information) for pre-trained code models. We show how probes can be used to identify whether models are deficient in (understanding) certain code properties, characterize different model layers, and get insight into the model sample-efficiency.

+ +

We probe four models that vary in their expected knowledge of code properties: BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow). While GraphCodeBERT performs more consistently overall, we find that BERT performs surprisingly well on some code tasks, which calls for further investigation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

JEMMA: An Extensible Java Dataset for ML4Code Applications

+

Anjan Karmakar, Miltiadis Allamanis, Romain Robbes. EMSE 2022

+

+ + [ArXiV] + + + +
+ + dataset + +

+

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code’s richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Visualizing and Understanding Recurrent Networks

+

Andrej Karpathy, Justin Johnson, Li Fei-Fei. 2015

+

+ + [ArXiV] + + + +
+ + language model + + code generation + +

+

Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful +applications in a wide range of machine learning problems that involve sequential +data. However, while LSTMs provide exceptional results in practice, the source +of their performance and their limitations remain rather poorly understood. Using character-level language models as an interpretable testbed, we aim to bridge +this gap by providing an analysis of their representations, predictions and error +types. In particular, our experiments reveal the existence of interpretable cells that +keep track of long-range dependencies such as line lengths, quotes and brackets. +Moreover, our comparative analysis with finite horizon n-gram models traces the +source of the LSTM improvements to long-range structural dependencies. Finally, +we provide analysis of the remaining errors and suggests areas for further study.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Towards Neural Decompilation

+

Omer Katz, Yuval Olshaker, Yoav Goldberg, Eran Yahav. 2019

+

+ + [ArXiV] + + + +
+ + decompilation + +

+

We address the problem of automatic decompilation, converting a program in low-level representation back to a higher-level human-readable programming language. The problem of decompilation is extremely important for security researchers. Finding vulnerabilities and understanding how malware operates is much easier when done over source code.

+ +

The importance of decompilation has motivated the construction of hand-crafted rule-based decompilers. Such decompilers have been designed by experts to detect specific control-flow structures and idioms in low-level code and lift them to source level. The cost of supporting additional languages or new language features in these models is very high.

+ +

We present a novel approach to decompilation based on neural machine translation. The main idea is to automatically learn a decompiler from a given compiler. Given a compiler from a source language S to a target language T , our approach automatically trains a decompiler that can translate (decompile) T back to S . We used our framework to decompile both LLVM IR and x86 assembly to C code with high success rates. Using our LLVM and x86 instantiations, we were able to successfully decompile over 97% and 88% of our benchmarks respectively.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

I Speak, You Verify: Toward Trustworthy Neural Program Synthesis

+

Darren Key, Wen-Ding Li, Kevin Ellis. 2022

+

+ + [ArXiV] + + + +
+ + synthesis + +

+

We develop an approach for improving the trustworthiness and overall accuracy of program synthesizers based on large language models for source code. Given a natural language description of a programming problem, our method samples both candidate programs as well as candidate predicates specifying how the program should behave. We learn to analyze the agreement between programs and predicates to judge both which program is most likely to be correct, and also judge whether the language model is able to solve the programming problem in the first place. This latter capacity allows favoring high precision over broad recall: fostering trust by only proposing a program when the system is certain that it is correct.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Reduce False Positives in Analytic Bug Detectors

+

Anant Kharkar, Roshanak Zilouchian Moghaddam, Matthew Jin, Xiaoyu Liu, Xin Shi, Colin Clement, Neel Sundaresan. ICSE 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + static analysis + +

+

Due to increasingly complex software design and rapid iterative development, code defects and security vulnerabilities are prevalent in modern software. In response, programmers rely on static analysis tools to regularly scan their codebases and find potential bugs. In order to maximize coverage, however, these tools generally tend to report a significant number of false positives, requiring developers to manually verify each warning. To address this problem, we propose a Transformer-based learning approach to identify false positive bug warnings. We demonstrate that our models can improve the precision of static analysis by 17.5%. In addition, we validated the generalizability of this approach across two major bug types: null dereference and resource leak.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code Prediction by Feeding Trees to Transformers

+

Seohyun Kim, Jinman Zhao, Yuchi Tian, Satish Chandra. 2020

+

+ + [ArXiV] + + [Code] + + + +
+ + autocomplete + +

+

In this paper, we describe how to leverage Transformer, a recent neural architecture for learning from sequential data (such as text), for code completion. As in the realm of natural language processing, Transformers surpass the prediction accuracy achievable by RNNs; we provide an experimental confirmation of this over a Python dataset.

+ +

Furthermore, we show that the way to obtain even better accuracy from Transformers is to expose the syntactic structure of code, which is easily recovered by parsing, to the neural network. This works significantly better than presenting the code as a linear token sequence, which is how Transformers were originally intended to be used.

+ +

To accomplish this, we propose a novel enhancement to the self-attention mechanism of the Transformer. We enable the mechanism to learn weights—that is, how much to focus on each preceding token in the input—not only on the basis of a token’s value, but also on the basis of the spatial relationships, as in their positions in the abstract syntax tree, between each pair of tokens.

+ +

We provide comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Python corpus internal to Facebook.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning a Classifier for False Positive Error Reports Emitted by Static Code Analysis Tools

+

Ugur Koc, Parsa Saadatpanah, Jeffrey S. Foster, Adam A. Porter.. MAPL 2017

+

+ + + +
+ + static analysis + +

+

The large scale and high complexity of modern software systems +make perfectly precise static code analysis (SCA) infeasible. Therefore SCA tools often over-approximate, so not to miss any real +problems. This, however, comes at the expense of raising false +alarms, which, in practice, reduces the usability of these tools.

+ +

To partially address this problem, we propose a novel learning +process whose goal is to discover program structures that cause +a given SCA tool to emit false error reports, and then to use this +information to predict whether a new error report is likely to be a +false positive as well. To do this, we first preprocess code to isolate +the locations that are related to the error report. Then, we apply +machine learning techniques to the preprocessed code to discover +correlations and to learn a classifier.

+ +

We evaluated this approach in an initial case study of a widely-used SCA tool for Java. Our results showed that for our dataset +we could accurately classify a large majority of false positive error +reports. Moreover, we identified some common coding patterns that +led to false positive errors. We believe that SCA developers may be +able to redesign their methods to address these patterns and reduce +false positive error reports.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

The Stack: 3TB of permissively licensed source code

+

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries. 2022

+

+ + [Preprint] + + + +
+ + dataset + +

+

Large Language Models (LLMs) play an ever-increasing role in the field of +Artificial Intelligence (AI)–not only for natural language processing but also +for code understanding and generation. To stimulate open and responsible +research on LLMs for code, we introduce The Stack, a 3.1 TB dataset +consisting of permissively licensed source code in 30 programming languages. +We describe how we collect the full dataset, construct a permissively licensed +subset, and present promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that +(1) near-deduplicating the data significantly boosts performance across all +experiments, and (2) it is possible to match previously reported HumanEval +and MBPP performance using only permissively licensed data. We make the +dataset available at https://hf.co/BigCode and give developers the possi- +bility to have their code removed from the dataset by following the instruc- +tions at https://www.bigcode-project.org/docs/about/the-stack/.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Energy-Based Models for Code Generation under Compilability Constraints

+

Tomasz Korbak, Hady Elsahar, Marc Dymetman, Germán Kruszewski. ACL 2021

+

+ + [ArXiV] + + + +
+ + code generation + +

+

Neural language models can be successfully trained on source code, leading to applications such as code completion. However, their versatile autoregressive self-supervision objective overlooks important global sequence-level features that are present in the data such as syntactic correctness or compilability. In this work, we pose the problem of learning to generate compilable code as constraint satisfaction. We define an Energy-Based Model (EBM) representing a pre-trained generative model with an imposed constraint of generating only compilable sequences. We then use the KL-Adaptive Distributional Policy Gradient algorithm (Khalifa et al., 2021) to train a generative model approximating the EBM. We conduct experiments showing that our proposed approach is able to improve compilability rates without sacrificing diversity and complexity of the generated samples.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Human perceiving behavior modeling in evaluation of code generation models

+

S. Kovalchuk, V. Lomshakov, A. Aliev. GEM 2022

+

+ + [ACLAnthology] + + + +
+ + code generation + + evaluation + + human evaluation + +

+

Within this study, we evaluated a series of code generation models based on CodeGen and GPTNeo to compare the metric-based performance and human evaluation. For a deeper analysis of human perceiving within the evaluation procedure we’ve implemented a 5-level Likert scale assessment of the model output using a perceiving model based on the Theory of Planned Behavior (TPB). Through such analysis, we showed an extension of model assessment as well as a deeper understanding of the quality and applicability of generated code for practical question answering. The approach was evaluated with several model settings in order to assess diversity in quality and style of answer. With the TPB-based model, we showed a different level of perceiving the model result, namely personal understanding, agreement level, and readiness to use the particular code. With such analysis, we investigate a series of issues in code generation as natural language generation (NLG) problems observed in a practical context of programming question-answering with code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Test-based and metric-based evaluation of code generation models for practical question answering

+

S. Kovalchuk, D. Fedrushkov, V. Lomshakov, A. Aliev. ICCQ 2023

+

+ + [IEEE] + + + +
+ + code generation + + test generation + + natural language generation + + evaluation + + metrics + + natural language processing + +

+

We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don’t pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

PathMiner : A Library for Mining of Path-Based Representations of Code

+

Vladimir Kovalenko, Egor Bogomolov, Timofey Bryksin, Alberto Bacchelli.. MSR 2019

+

+ + [Zenodo] + + + +
+ + representation + + grammar + +

+

One recent, significant advance in modeling source code for machine learning algorithms has been the introduction of path-based representation – an approach consisting in representing a snippet of code as a collection of paths from its syntax tree. Such representation efficiently captures the structure of code, which, in turn, carries its semantics and other information. +Building the path-based representation involves parsing the code and extracting the paths from its syntax tree; these steps build up to a substantial technical job. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from the essential work and hinders newcomers in the field of machine learning on code.

+ +

In this paper, we present PathMiner – an open-source library for mining path-based representations of code. PathMiner is fast, flexible, well-tested, and easily extensible to support input code in any common programming language. Preprint [https://doi.org/10.5281/zenodo.2595271]; released tool [https://doi.org/10.5281/zenodo.2595257].

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Factor Graph Model for Software Bug Finding

+

Ted Kremenek, Andrew Y. Ng, Dawson R. Engler.. IJCAI 2007

+

+ + + +
+ + program analysis + +

+

Automatic tools for finding software errors require +knowledge of the rules a program must obey, or +“specifications,” before they can identify bugs. We +present a method that combines factor graphs and +static program analysis to automatically infer specifications directly from programs. We illustrate the +approach on inferring functions in C programs that +allocate and release resources, and evaluate the approach on three codebases: SDL, OpenSSH, and +the OS kernel for Mac OS X (XNU). The inferred +specifications are highly accurate and with them we +have discovered numerous bugs.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SPoC: Search-based Pseudocode to Code

+

Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, Percy S. Liang. 2019

+

+ + + +
+ + bimodal + + synthesis + +

+

We consider the task of mapping pseudocode to long programs that are functionally correct. Given test cases as a mechanism to validate programs, we search over the space of possible translations of the pseudocode to find a program that passes the validation. However, without proper credit assignment to localize the sources of program failures, it is difficult to guide search toward more promising programs. We propose to perform credit assignment based on signals from compilation errors, which constitute 88.7% of program failures. Concretely, we treat the translation of each pseudocode line as a discrete portion of the program, and whenever a synthesized program fails to compile, an error localization method tries to identify the portion of the program responsible for the failure. We then focus search over alternative translations of the pseudocode for those portions. For evaluation, we collected the SPoC dataset (Search-based Pseudocode to Code) containing 18,356 programs with human-authored pseudocode and test cases. Under a budget of 100 program compilations, performing search improves the synthesis success rate over using the top-one translation of the pseudocode from 25.6% to 44.7%.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Recommendation of Move Method Refactoring Using Path-Based Representation of Code

+

Zarina Kurbatova, Ivan Veselov, Yaroslav Golubev, Timofey Bryksin. 2020

+

+ + [ArXiV] + + + +
+ + refactoring + +

+

Software refactoring plays an important role in increasing code quality. One of the most popular refactoring types is the Move Method refactoring. It is usually applied when a method depends more on members of other classes than on its own original class. Several approaches have been proposed to recommend Move Method refactoring automatically. Most of them are based on heuristics and have certain limitations (e.g., they depend on the selection of metrics and manually-defined thresholds). In this paper, we propose an approach to recommend Move Method refactoring based on a path-based representation of code called code2vec that is able to capture the syntactic structure and semantic information of a code fragment. We use this code representation to train a machine learning classifier suggesting to move methods to more appropriate classes. We evaluate the approach on two publicly available datasets: a manually compiled dataset of well-known open-source projects and a synthetic dataset with automatically injected code smell instances. The results show that our approach is capable of recommending accurate refactoring opportunities and outperforms JDeodorant and JMove, which are state of the art tools in this field.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Using Semantic Unification to Generate Regular Expressions from Natural Language

+

Nate Kushman, Regina Barzilay. NAACL 2013

+

+ + + +
+ + bimodal + + code generation + +

+

We consider the problem of translating natural language text queries into regular expressions which represent their meaning. The mismatch in the level of abstraction between the natural language representation and the regular expression representation make this a novel and challenging problem. However, a given regular expression can be written in many semantically equivalent forms, and we exploit this flexibility to facilitate translation by finding a form which more directly corresponds to the natural language. We evaluate our technique on a set of natural language queries and their associated regular expressions which we gathered from Amazon Mechanical Turk. Our model substantially outperforms a state-of-the-art semantic parsing baseline, yielding a 29% absolute improvement in accuracy.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Unsupervised Translation of Programming Languages

+

Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample. 2020

+

+ + [ArXiV] + + [GitHub] + + + +
+ + migration + +

+

A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Neural Approach to Decompiled Identifier Renaming

+

Jeremy Lacomis, Pengcheng Yin, Edward J. Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, Bogdan Vasilescu. ASE 2019

+

+ + [ArXiV] + + [Code and Data] + + + +
+ + deobfuscation + + naming + + compilation + +

+

The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. However, compilation loses information contained within the original source code (e.g. structure, type information, and variable names). Semantically meaningful variable names are known to increase code understandability, but they generally cannot be recovered by decompilers. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GitHub. Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3% of the time.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Exploring the Naturalness of Buggy Code with Recurrent Neural Network

+

Jack Lanchantin, Ji Gao. 2018

+

+ + [ArXiV] + + + +
+ + language model + + defect + +

+

Statistical language models are powerful tools +which have been used for many tasks within natural language processing. Recently, they have been +used for other sequential data such as source code. +(Ray et al., 2015) showed that it is possible train an +n-gram +source code language mode, and use it to +predict buggy lines in code by determining “unnatural” lines via entropy with respect to the language +model. In this work, we propose using a more advanced language modeling technique, Long Short-term Memory recurrent neural networks, to model +source code and classify buggy lines based on entropy. We show that our method slightly outperforms an +n-gram model in the buggy line classification task using AUC

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Neural Model for Generating Natural Language Summaries of Program Subroutines

+

Alexander LeClair, Siyuan Jiang, Collin McMillan. ICSE 2019

+

+ + [ArXiV] + + [Code and Data] + + + +
+ + summarization + + documentation + +

+

Source code summarization – creating natural language descriptions of source code behavior – is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven approaches based on neural machine translation have largely overtaken template-based systems. But nearly all of these techniques rely almost entirely on programs having good internal documentation; without clear identifier names, the models fail to create good summaries. In this paper, we present a neural model that combines words from code with code structure from an AST. Unlike previous approaches, our model processes each data source as a separate input, which allows the model to learn code structure independent of the text in code. This process helps our approach provide coherent summaries in many cases even when zero internal documentation is provided. We evaluate our technique with a dataset we created from 2.1m Java methods. We find improvement over two baseline techniques from SE literature and one from NLP literature.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Recommendations for Datasets for Source Code Summarization

+

Alexander LeClair, Collin McMillan. NAACL 2019 2019

+

+ + + +
+ + summarization + + dataset + +

+

Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable datasets. In addition, a lack of community standards for creating datasets leads to confusing and unreproducible research results – we observe swings in performance of more than 33% due only to changes in dataset design. In this paper, we make recommendations for these standards from experimental results. We release a dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects. We describe the dataset and point out key differences from natural language data, to guide and support future researchers.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Improved Code Summarization via a Graph Neural Network

+

Alexander LeClair, Sakib Haque, Lingfei Wu, Collin McMillan. 2020

+

+ + [ArXiV] + + + +
+ + summarization + +

+

Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. The first approaches to use structural information flattened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using flattened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. Therefore, in this paper, we present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the software engineering literature, and two from machine learning literature.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer

+

Suyoung Lee, HyungSeok Han, Sang Kil Cha, Sooel Son. USENIX 2020

+

+ + [ArXiV] + + + +
+ + fuzzing + + language model + +

+

JavaScript (JS) engine vulnerabilities pose significant security threats affecting billions of web browsers. While fuzzing is a prevalent technique for finding such vulnerabilities, there have been few studies that leverage the recent advances in neural network language models (NNLMs). In this paper, we present Montage, the first NNLM-guided fuzzer for finding JS engine vulnerabilities. The key aspect of our technique is to transform a JS abstract syntax tree (AST) into a sequence of AST subtrees that can directly train prevailing NNLMs. We demonstrate that Montage is capable of generating valid JS tests, and show that it outperforms previous studies in terms of finding vulnerabilities. Montage found 37 real-world bugs, including three CVEs, in the latest JS engines, demonstrating its efficacy in finding JS engine bugs.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Co-Training for Commit Classification

+

Jian Yi, David Lee, Hai Leong Chieu. EMNLP WNUT 2021

+

+ + [website] + + [code] + + + +
+ + Transformer + + bimodal + + defect + +

+

Commits in version control systems (e.g. Git) track changes in a software project. Commits comprise noisy user-generated natural language and code patches. Automatic commit classification (CC) has been used to determine the type of code maintenance activities performed, as well as to detect bug fixes in code repositories. Much prior work occurs in the fully-supervised setting – a setting that can be a stretch in resource-scarce situations presenting difficulties in labeling commits. In this paper, we apply co-training, a semi-supervised learning method, to take advantage of the two views available – the commit message (natural language) and the code changes (programming language) – to improve commit classification.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Align the Source Code to the Compiled Object Code

+

Dor Levy, Lior Wolf. ICML 2017

+

+ + + +
+ + decompilation + +

+

We propose a new neural network architecture +and use it for the task of statement-by-statement +alignment of source code and its compiled object code. Our architecture learns the alignment +between the two sequences – one being the translation of the other – by mapping each statement +to a context-dependent representation vector and +aligning such vectors using a grid of the two sequence domains. Our experiments include short +C functions, both artificial and human-written, +and show that our neural network architecture +is able to predict the alignment with high accuracy, outperforming known baselines. We also +demonstrate that our model is general and can +learn to solve graph problems such as the Traveling Salesman Problem.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Topical: Learning Repository Embeddings from Source Code using Attention

+

Agathe Lherondelle, Yash Satsangi, Fran Silavong, Shaltiel Eloul, Sean Moran. Arxiv 2022

+

+ + [ArXiV] + + + +
+ + representation + + topic modelling + +

+

Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode +augments the software developer’s capabilities with code autogeneration, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script level +representation of code is sufficient, however, in many cases a repository level representation that takes into account various dependencies and repository structure is imperative, for example, +auto-tagging repositories with topics or auto-documentation of repository code etc. Existing methods for computing repository level representations suffer from (a) reliance on natural language +documentation of code (for example, README files) (b) naive aggregation of method/script-level representation, for example, by concatenation or averaging. This paper introduces Topical a +deep neural network to generate repository level embeddings of publicly available GitHub code repositories directly from source code. Topical incorporates an attention mechanism that projects the source code, the full dependency graph and the +script level textual information into a dense repository-level representation. To compute the repository-level representations, Topical is trained to predict the topics associated with a repository, on a dataset of publicly available GitHub repositories that +were crawled along with their ground truth topic tags. Our experiments show that the embeddings computed by Topical are able to outperform multiple baselines, including baselines +that naively combine the method-level representations through averaging or concatenation at the task of repository auto-tagging. Furthermore, we show that Topical’s attention mechanism outperforms naive aggregation methods when computing repositorylevel representations from script-level representation generated +by existing methods. Topical is a lightweight framework for computing repository-level representation of code repositories that scales efficiently with the number of topics and dataset size.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Gated Graph Sequence Neural Networks

+

Yujia Li, Daniel Tarlow, Marc Brockschmidt, Richard Zemel. ICLR 2016

+

+ + [ArXiV] + + + +
+ + GNN + + program analysis + +

+

Graph-structured data appears frequently in domains including chemistry, natural +language semantics, social networks, and knowledge bases. In this work, we study +feature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify +to use gated recurrent units and modern optimization techniques and then extend +to output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based +models (e.g., LSTMs) when the problem is graph-structured. We demonstrate the +capabilities on some simple AI (bAbI) and graph algorithm learning tasks. We +then show it achieves state-of-the-art performance on a problem from program +verification, in which subgraphs need to be described as abstract data structures.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code Completion with Neural Attention and Pointer Networks

+

Jian Li, Yue Wang, Michael R. Lyu, Irwin King. 2017

+

+ + [ArXiV] + + + +
+ + language model + + autocomplete + +

+

Intelligent code completion has become an essential tool to accelerate modern software development. To facilitate effective code completion for dynamically-typed programming languages, we apply neural language models by learning from large codebases, and investigate the effectiveness of attention mechanism on the code completion task. However, standard neural language models even with attention mechanism cannot correctly predict out-of-vocabulary (OoV) words thus restrict the code completion performance. In this paper, inspired by the prevalence of locally repeated terms in program source code, and the recently proposed pointer networks which can reproduce words from local context, we propose a pointer mixture network for better predicting OoV words in code completion. Based on the context, the pointer mixture network learns to either generate a within-vocabulary word through an RNN component, or copy an OoV word from local context through a pointer component. Experiments on two benchmarked datasets demonstrate the effectiveness of our attention mechanism and pointer mixture network on the code completion task.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Software Defect Prediction via Convolutional Neural Network

+

Jian Li, Pinjia He, Jieming Zhu, Michael R. Lyu. QRS 2017

+

+ + + +
+ + defect + +

+

To improve software reliability, software defect prediction is utilized to assist developers in finding potential bugs +and allocating their testing efforts. Traditional defect prediction +studies mainly focus on designing hand-crafted features, which +are input into machine learning classifiers to identify defective +code. However, these hand-crafted features often fail to capture +the semantic and structural information of programs. Such +information is important in modeling program functionality and +can lead to more accurate defect prediction. +In this paper, we propose a framework called Defect Prediction +via Convolutional Neural Network (DP-CNN), which leverages +deep learning for effective feature generation. Specifically, based +on the programs’ Abstract Syntax Trees (ASTs), we first extract +token vectors, which are then encoded as numerical vectors +via mapping and word embedding. We feed the numerical +vectors into Convolutional Neural Network to automatically +learn semantic and structural features of programs. After that, +we combine the learned features with traditional hand-crafted +features, for accurate software defect prediction. We evaluate our +method on seven open source projects in terms of F-measure in +defect prediction. The experimental results show that in average, +DP-CNN improves the state-of-the-art method by 12%.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Improving Bug Detection via Context-Based Code Representation Learning and Attention-Based Neural Networks

+

Yi Li, Shaohua Wang, Tien N. Nguyen, Son Van Nguyen. OOPSLA 2019

+

+ + + +
+ + representation + + defect + +

+

Bug detection has been shown to be an effective way to help developers in detecting bugs early, thus, saving much effort and time in software development process. Recently, deep learning-based bug detection approaches have gained successes over the traditional machine learning-based approaches, the rule-based program analysis approaches, and mining-based approaches. However, they are still limited in detecting bugs that involve multiple methods and suffer high rate of false positives. In this paper, we propose a combination approach with the use of contexts and attention neural network to overcome those limitations. We propose to use as the global context the Program Dependence Graph (PDG) and Data Flow Graph (DFG) to connect the method under investigation with the other relevant methods that might contribute to the buggy code. The global context is complemented by the local context extracted from the path on the AST built from the method’s body. The use of PDG and DFG enables our model to reduce the false positive rate, while to complement for the potential reduction in recall, we make use of the attention neural network mechanism to put more weights on the buggy paths in the source code. That is, the paths that are similar to the buggy paths will be ranked higher, thus, improving the recall of our model. We have conducted several experiments to evaluate our approach on a very large dataset with +4.973M methods in 92 different project versions. The results show that our tool can have a relative improvement up to 160% on F-score when comparing with the state-of-the-art bug detection approaches. Our tool can detect 48 true bugs in the list of top 100 reported bugs, which is 24 more true bugs when comparing with the baseline approaches. We also reported that our representation is better suitable for bug detection and relatively improves over the other representations up to 206% in accuracy.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Code Search Evaluation Dataset

+

Hongyu Li, Seohyun Kim, Satish Chandra. 2019

+

+ + [ArXiV] + + [Dataset] + + + +
+ + dataset + + search + +

+

There has been an increase of interest in code search using natural language. Assessing the performance of such code search models can be difficult without a readily available evaluation suite. In this paper, we present an evaluation dataset consisting of natural language query and code snippet pairs, with the hope that future work in this area can use this dataset as a common benchmark. We also provide the results of two code search models ([1] and [6]) from recent work.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Using GGNN to recommend log statement level

+

Mingzhe Li, Jianrui Pei, Jin He, Kevin Song, Frank Che, Yongfeng Huang, Chitai Wang. 2019

+

+ + [ArXiV] + + + +
+ + GNN + + logging + +

+

In software engineering, log statement is an important part because programmers can’t access to users’ program and they can only rely on log message to find the root of bugs. The mechanism of “log level” allows developers and users to specify the appropriate amount of logs to print during the execution of the software. And 26\% of the log statement modification is to modify the level. We tried to use ML method to predict the suitable level of log statement. The specific model is GGNN(gated graph neural network) and we have drawn lessons from Microsoft’s research. In this work, we apply Graph Neural Networks to predict the usage of log statement level of some open source java projects from github. Given the good performance of GGNN in this task, we are confident that GGNN is an excellent choice for processing source code. We envision this model can play an important role in applying AI/ML technique for Software Development Life Cycle more broadly.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DLFix: Context-based Code Transformation Learning for Automated Program Repair

+

Yi Li, Shaohua Wang, Tien N. Nguyen. ICSE 2020

+

+ + + +
+ + edit + + repair + + grammar + +

+

Automated Program Repair (APR) is very useful in helping developers in the process of software development and maintenance. Despite recent advances in deep learning (DL), the DL-based APR approaches still have limitations in learning bug-fixing code changes and the context of the surrounding source code of the bug-fixing code changes. These limitations lead to incorrect fixing locations or fixes. In this paper, we introduce DLFix, a two-tier DL model that treats APR as code transformation learning from the prior bug fixes and the surrounding code contexts of the fixes. The first layer is a tree-based RNN model that learns the contexts of bug fixes and its result is used as an additional weighting input for the second layer designed to learn the bug-fixing code transformations.

+ +

We conducted several experiments to evaluate DLFix in two benchmarks: Defect4J and Bugs.jar, and a newly built bug datasets with a total of +20K real-world bugs in eight projects. We compared DLFix against a total of 13 state-of-the-art pattern-based APR tools. Our results show that DLFix can auto-fix more bugs than 11 of them, and is comparable and complementary to the top two pattern-based APR tools in which there are 7 and 11 unique bugs that they cannot detect, respectively, but we can. Importantly, DLFix is fully automated and data-driven, and does not require hard-coding of bug-fixing patterns as in those tools. We compared DLFix against 4 state-of-the-art deep learning based APR models. DLFix is able to fix 2.5 times more bugs than the best performing~baseline.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Code-Query Interaction for Enhancing Code Searches

+

Wei Li, Haozhe Qin, Shuhan Yan, Beijun Shen, Yuting Chen. ICSME 2020

+

+ + [IEEE] + + + +
+ + search + +

+

Code search plays an important role in software development and maintenance. In recent years, deep learning (DL) has achieved a great success in this domain-several DL-based code search methods, such as DeepCS and UNIF, have been proposed for exploring deep, semantic correlations between code and queries; each method usually embeds source code and natural language queries into real vectors followed by computing their vector distances representing their semantic correlations. Meanwhile, deep learning-based code search still suffers from three main problems, i.e., the OOV (Out of Vocabulary) problem, the independent similarity matching problem, and the small training dataset problem. To tackle the above problems, we propose CQIL, a novel, deep learning-based code search method. CQIL learns code-query interactions and uses a CNN (Convolutional Neural Network) to compute semantic correlations between queries and code snippets. In particular, CQIL employs a hybrid representation to model code-query correlations, which solves the OOV problem. CQIL also deeply learns the code-query interaction for enhancing code searches, which solves the independent similarity matching and the small training dataset problems. We evaluate CQIL on two datasets (CODEnn and CosBench). The evaluation results show the strengths of CQIL-it achieves the MAP@1 values, 0.694 and 0.574, on CODEnn and CosBench, respectively. In particular, it outperforms DeepCS and UNIF, two state-of-the-art code search methods, by 13.6% and 18.1% in MRR, respectively, when the training dataset is insufficient.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Extend Program Graphs to Work-in-Progress Code

+

Xuechen Li, Chris J. Maddison, Daniel Tarlow. 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + autocomplete + + repair + +

+

Source code spends most of its time in a broken or incomplete state during software development. This presents a challenge to machine learning for code, since high-performing models typically rely on graph structured representations of programs derived from traditional program analyses. Such analyses may be undefined for broken or incomplete code. We extend the notion of program graphs to work-in-progress code by learning to predict edge relations between tokens, training on well-formed code before transferring to work-in-progress code. We consider the tasks of code completion and localizing and repairing variable misuse in a work-in-process scenario. We demonstrate that training relation-aware models with fine-tuned edges consistently leads to improved performance on both tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Toward Less Hidden Cost of Code Completion with Acceptance and Ranking Models

+

Jingxuan Li, Rui Huang, Wei Li, Kai Yao, Weiguo Tan. ICSME 2021

+

+ + [ArXiV] + + + +
+ + autocomplete + + language model + + optimization + + Transformer + +

+

Code completion is widely used by software developers to provide coding suggestions given a partially written code snippet. Apart from the traditional code completion methods, which only support single token completion at minimal positions, recent studies show the ability to provide longer code completion at more flexible positions. However, such frequently triggered and longer completion results reduce the overall precision as they generate more invalid results. Moreover, different studies are mostly incompatible with each other. Thus, it is vital to develop an ensemble framework that can combine results from multiple models to draw merits and offset defects of each model. +This paper conducts a coding simulation to collect data from code context and different code completion models and then apply the data in two tasks. First, we introduce an acceptance model which can dynamically control whether to display completion results to the developer. It uses simulation features to predict whether correct results exist in the output of these models. Our best model reduces the percentage of false-positive completion from 55.09% to 17.44%. Second, we design a fusion ranking scheme that can automatically identify the priority of the completion results and reorder the candidates from multiple code completion models. This scheme is flexible in dealing with various models, regardless of the type or the length of their completion results. We integrate this ranking scheme with two frequency models and a GPT-2 styled language model, along with the acceptance model to yield 27.80% and 37.64% increase in TOP1 and TOP5 accuracy, respectively. In addition, we propose a new code completion evaluation metric, Benefit-Cost Ratio(BCR), taking into account the benefit of keystrokes saving and hidden cost of completion list browsing, which is closer to real coder experience scenario.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeReviewer: Pre-Training for Automating Code Review Activities

+

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan. 2022

+

+ + [ArXiV] + + + +
+ + review + +

+

Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review senario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Exploring Representation-Level Augmentation for Code Search

+

Haochen Li, Chunyan Miao, Cyril Leung, Yanxian Huang, Yuan Huang, Hongyu Zhang, Yanlin Wang. EMNLP 2022

+

+ + [ArXiV] + + [code] + + + +
+ + search + + Transformer + +

+

Code search, which aims at retrieving the most relevant code fragment for a given natural language query, is a common activity in software development practice. Recently, contrastive learning is widely used in code search research, where many data augmentation approaches for source code (e.g., semantic-preserving program transformation) are proposed to learn better representations. However, these augmentations are at the raw-data level, which requires additional code analysis in the preprocessing stage and additional training costs in the training stage. In this paper, we explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training, and based on this we propose a general format of representation-level augmentation that unifies existing methods. Then, we propose three new augmentation methods (linear extrapolation, binary interpolation, and Gaussian scaling) based on the general format. Furthermore, we theoretically analyze the advantages of the proposed augmentation methods over traditional contrastive learning methods on code search. We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. The experimental results show that our approach can consistently boost the performance of the studied code search models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models

+

Haonan Li, Yu Hao, Yizhuo Zhai, Zhiyun Qian. 2023

+

+ + [ArXiV] + + + +
+ + static analysis + +

+

Static analysis is a widely used technique in software engineering for identifying and mitigating bugs. However, a significant hurdle lies in achieving a delicate balance between precision and scalability. Large Language Models (LLMs) offer a promising alternative, as recent advances demonstrate remarkable capabilities in comprehending, generating, and even debugging code. Yet, the logic of bugs can be complex and require sophisticated reasoning and a large analysis scope spanning multiple functions. Therefore, at this point, LLMs are better used in an assistive role to complement static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis, using use-before-initialization (UBI) bugs as a case study. To this end, we develop LLift, a fully automated agent that interfaces with both a static analysis tool and an LLM. By carefully designing the agent and the prompts, we are able to overcome a number of challenges, including bug-specific modeling, the large problem scope, the non-deterministic nature of LLMs, etc. Tested in a real-world scenario analyzing nearly a thousand potential UBI bugs produced by static analysis, LLift demonstrates an extremely potent capability, showcasing a high precision (50%) and recall rate (100%). It even identified 13 previously unknown UBI bugs in the Linux kernel. This research paves the way for new opportunities and methodologies in the use of LLMs for bug discovery in extensive, real-world datasets.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Rethinking Negative Pairs in Code Search

+

Haochen Li, Xin Zhou, Luu Anh Tuan, Chunyan Miao. EMNLP 2023

+

+ + [ArXiV] + + [code] + + + +
+ + search + + Transformer + + retrieval + + optimization + + representation + +

+

Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less ``negative’’ than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

StarCoder: may the source be with you!

+

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries. 2023

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation

+

Xin-Ye Li, Jiang-Tian Xue, Zheng Xie, Ming Li. 2023

+

+ + [ArXiV] + + + +
+ + generation + + Transformer + +

+

Code generation aims to automatically generate source code from high-level task specifications, which can significantly increase productivity of software engineering. Recently, approaches based on large language models (LLMs) have shown remarkable code generation abilities on simple tasks. However, generate code for more complex tasks, such as competition-level problems, remains challenging. In this paper, we introduce Brainstorm framework for code generation. It leverages a brainstorming step that generates and selects diverse thoughts on the problem to facilitate algorithmic reasoning, where the thoughts are possible blueprint of solving the problem. We demonstrate that Brainstorm significantly enhances the ability of LLMs to solve competition-level programming problems, resulting in a more than 50% increase in the pass@$k$ metrics for ChatGPT on the CodeContests benchmark, achieving state-of-the-art performance. Furthermore, our experiments conducted on LeetCode contests show that our framework boosts the ability of ChatGPT to a level comparable to that of human programmers.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search

+

Haochen Li, Xin Zhou, Zhiqi Shen. 2024

+

+ + [ArXiV] + + + +
+ + search + + large language models + + metrics + +

+

In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of Large Language Models (LLMs). Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Shellcode_IA32: A Dataset for Automatic Shellcode Generation

+

Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella, Bojan Cukic, Samira Shaikh. NLP4Prog 2021

+

+ + [PDF] + + + +
+ + code generation + + dataset + +

+

We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability, starting from natural language comments. We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions. We experiment with standard methods in neural machine translation (NMT) to establish baseline performance levels on this task.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Program Synthesis from Natural Language Using Recurrent Neural Networks

+

Xi Victoria Lin, Chenglong Wang, Deric Pang, Kevin Vu, Michael D. Ernst. Technical Report UW-CSE-17-03-01, University of Washington Department of Computer Science and Engineering 2017

+

+ + [PDF] + + [Tool] + + + +
+ + bimodal + + code generation + +

+

Oftentimes, a programmer may have difficulty implementing a +desired operation. Even when the programmer can describe her +goal in English, it can be difficult to translate into code. Existing +resources, such as question-and-answer websites, tabulate specific +operations that someone has wanted to perform in the past, but +they are not effective in generalizing to new tasks, to compound +tasks that require combining previous questions, or sometimes even +to variations of listed tasks.

+ +

Our goal is to make programming easier and more productive by +letting programmers use their own words and concepts to express +the intended operation, rather than forcing them to accommodate +the machine by memorizing its grammar. We have built a system +that lets a programmer describe a desired operation in natural language, then automatically translates it to a programming language +for review and approval by the programmer. Our system, Tellina, +does the translation using recurrent neural networks (RNNs), a +state-of-the-art natural language processing technique that we augmented with slot (argument) filling and other enhancements.

+ +

We evaluated Tellina in the context of shell scripting. We trained +Tellina’s RNNs on textual descriptions of file system operations +and bash one-liners, scraped from the web. Although recovering +completely correct commands is challenging, Tellina achieves top-3 +accuracy of 80% for producing the correct command structure. In a +controlled study, programmers who had access to Tellina outperformed those who did not, even when Tellina’s predictions were +not completely correct, to a statistically significant degree.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System

+

Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, Michael D. Ernst. LREC 2018

+

+ + [PDF] + + [ArXiV] + + + +
+ + bimodal + + code generation + +

+

We present new data and semantic parsing methods for the problem of mapping english sentences to Bash commands (NL2Bash). Our long-term goal is to enable any user to easily solve otherwise repetitive tasks (such as file manipulation, search, and application-specific scripting) by simply stating their intents in English. We take a first step in this domain, by providing a large new dataset of challenging but commonly used commands paired with their English descriptions, along with the baseline methods to establish performance levels on this task.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On the Impact of Refactoring Operations on Code Naturalness

+

Bin Lin, Csaba Nagy, Gabriele Bavota, Michele Lanza. SANER 2019

+

+ + [IEEEexplore] + + [PDF] + + + +
+ + language model + + refactoring + +

+

Recent studies have demonstrated that software is natural, that is, its source code is highly repetitive and predictable like human languages. Also, previous studies suggested the existence of a relationship between code quality and its naturalness, presenting empirical evidence showing that buggy code is “less natural” than non-buggy code. We conjecture that this qualitynaturalness relationship could be exploited to support refactoring activities (e.g., to locate source code areas in need of refactoring). We perform a first step in this direction by analyzing whether refactoring can improve the naturalness of code. We use state-of-the-art tools to mine a large dataset of refactoring operations performed in open source systems. Then, we investigate the impact of different types of refactoring operations on the naturalness of the impacted code. We found that (i) code refactoring does not necessarily increase the naturalness of the refactored code; and (ii) the impact on the code naturalness strongly depends on the type of refactoring operations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Latent Predictor Networks for Code Generation

+

Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Andrew Senior, Fumin Wang, Phil Blunsom. ACL 2016

+

+ + [ArXiV] + + + +
+ + bimodal + + code generation + +

+

Many language generation tasks require +the production of text conditioned on both +structured and unstructured inputs. +We present a novel neural network architecture which generates an output sequence +conditioned on an arbitrary number of input functions. +Crucially, our approach +allows both the choice of conditioning +context and the granularity of generation, +for example characters or tokens, to be +marginalised, thus permitting scalable and +effective training. Using this framework, +we address the problem of generating programming code from a mixed natural language and structured specification. +We create two new data sets for this paradigm +derived from the collectible trading card +games Magic the Gathering and Hearthstone. On these, and a third preexisting +corpus, we demonstrate that marginalising multiple predictors allows our model +to outperform strong benchmarks.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Adaptive Deep Code Search

+

Chunyang Ling, Zeqi Lin, Yanzhen Zou, Bing Xie. ICPC 2020

+

+ + [ACM] + + + +
+ + search + +

+

Searching code in a large-scale codebase using natural language queries is a common practice during software development. Deep learning-based code search methods demonstrate superior performance if models are trained with large amount of text-code pairs. However, few deep code search models can be easily transferred from one codebase to another. It can be very costly to prepare training data for a new codebase and re-train an appropriate deep learning model. In this paper, we propose AdaCS, an adaptive deep code search method that can be trained once and transferred to new codebases. AdaCS decomposes the learning process into embedding domain-specific words and matching general syntactic patterns. Firstly, an unsupervised word embedding technique is used to construct a matching matrix to represent the lexical similarities. Then, a recurrent neural network is used to capture latent syntactic patterns from these matching matrices in a supervised way. As the supervised task learns general syntactic patterns that exist across domains, AdaCS is transferable to new codebases. Experimental results show that: when extended to new software projects never seen in the training data, AdaCS is more robust and significantly outperforms state-of-the-art deep code search methods.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Graph Matching and Searching for Semantic Code Retrieval

+

Xiang Ling, Lingfei Wu, Saizhuo Wang, Gaoning Pan, Tengfei Ma, Fangli Xu, Alex X. Liu, Chunming Wu, Shouling Ji. TKDD 2020

+

+ + [ArXiV] + + + +
+ + search + + GNN + +

+

Code retrieval is to find the code snippet from a large corpus of source code repositories that highly matches the query of natural language description. Recent work mainly uses natural language processing techniques to process both query texts (i.e., human natural language) and code snippets (i.e., machine programming language), however neglecting the deep structured features of query texts and source codes, both of which contain rich semantic information. In this paper, we propose an end-to-end deep graph matching and searching (DGMS) model based on graph neural networks for the task of semantic code retrieval. To this end, we first represent both natural language query texts and programming language code snippets with the unified graph-structured data, and then use the proposed graph matching and searching model to retrieve the best matching code snippet. In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them by cross-attention based semantic matching operations. We evaluate the proposed DGMS model on two public code retrieval datasets with two representative programming languages (i.e., Java and Python). Experiment results demonstrate that DGMS significantly outperforms state-of-the-art baseline models by a large margin on both datasets. Moreover, our extensive ablation studies systematically investigate and illustrate the impact of each part of DGMS.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Towards Better Program Obfuscation: Optimization via Language Models

+

Han Liu. ICSE 2016

+

+ + + +
+ + deobfuscation + +

+

As a common practice in software development, program +obfuscation aims at deterring reverse engineering and malicious attacks on released source or binary code. Owning ample obfuscation techniques, we have relatively little +knowledge on how to most effectively use them. The biggest +challenge lies in identifying the most useful combination of +these techniques. We propose a unified framework to automatically generate and optimize obfuscation based on an +obscurity language model and a Monte Carlo Markov Chain +(MCMC) based search algorithm. We further instantiate it +for JavaScript programs and developed the Closure tool. +Compared to the well-known Google Closure Compiler, Closure outperforms its default setting by 26%. For programs +which have already been well obfuscated, Closure can still +outperform by 22%.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?

+

Zhongxin Liu, Xin Xia, Ahmed E. Hassan, David Lo, Zhenchang Xing, Xinyu Wang. ASE 2018

+

+ + + +
+ + edit + + summarization + +

+

Commit messages can be regarded as the documentation of software changes. These messages describe the content and purposes of changes, hence are useful for program comprehension and software maintenance. However, due to the lack of time and direct motivation, commit messages sometimes are neglected by developers. To address this problem, Jiang et al. proposed an approach (we refer to it as NMT), which leverages a neural machine translation algorithm to automatically generate short commit messages from code. The reported performance of their approach is promising, however, they did not explore why their approach performs well. Thus, in this paper, we first perform an in-depth analysis of their experimental results. We find that (1) Most of the test <pre>diffs</pre> from which NMT can generate high-quality messages are similar to one or more training <pre>diffs</pre> at the token level. (2) About 16% of the commit messages in Jiang et al.’s dataset are noisy due to being automatically generated or due to them describing repetitive trivial changes. (3) The performance of NMT declines by a large amount after removing such noisy commit messages. In addition, NMT is complicated and time-consuming. Inspired by our first finding, we proposed a simpler and faster approach, named NNGen (Nearest Neighbor Generator), to generate concise commit messages using the nearest neighbor algorithm. Our experimental results show that NNGen is over 2,600 times faster than NMT, and outperforms NMT in terms of BLEU (an accuracy measure that is widely used to evaluate machine translation systems) by 21%. Finally, we also discuss some observations for the road ahead for automated commit message generation to inspire other researchers.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing

+

Xiao Liu, Xiaoting Li, Rupesh Prajapati, Dinghao Wu. AAAI 2019

+

+ + + +
+ + fuzzing + + code generation + +

+

Compilers are among the most fundamental programming +tools for building software. However, production compilers +remain buggy. Fuzz testing is often leveraged with newly-generated, +or mutated inputs in order to find new bugs or security vulnerabilities. +In this paper, we propose a grammar-based fuzzing tool called DeepFuzz. Based on a generative +Sequence-to-Sequence model, DeepFuzz automatically and continuously generates well-formed +C programs. We use this set of new C programs to fuzz off-the-shelf C compilers, e.g. GCC and Clang/LLVM. +We present a detailed case study to analyze the success rate and coverage improvement of the +generated C programs for fuzz testing. We analyze the performance of DeepFuzz with three types of sampling +methods as well as three types of generation strategies. Consequently, DeepFuzz +improved the testing efficacy in regards to the line, function, and branch coverage. In our preliminary +study, we found and reported 8 bugs of GCC, all of which are actively being addressed by developers.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Generating commit messages from diffs using pointer-generator network

+

Qin Liu, Zihe Liu, Hongming Zhu, Hongfei Fan, Bowen Du, Yu Qian.. MSR 2019

+

+ + + +
+ + edit + +

+

The commit messages in source code repositories are valuable but not easy to be generated manually in time for tracking issues, reporting bugs, and understanding codes. Recently published works indicated that the deep neural machine translation approaches have drawn considerable attentions on automatic generation of commit messages. However, they could not deal with out-of-vocabulary (OOV) words, which are essential context-specific identifiers such as class names and method names in code diffs. In this paper, we propose PtrGNCMsg, a novel approach which is based on an improved sequence-to-sequence model with the pointer-generator network to translate code diffs into commit messages. By searching the smallest identifier set with the highest probability, PtrGNCMsg outperforms recent approaches based on neural machine translation, and first enables the prediction of OOV words. The experimental results based on the corpus of diffs and manual commit messages from the top 2,000 Java projects in GitHub show that PtrGNCMsg outperforms the state-of-the-art approach with improved BLEU by 1.02, ROUGE-1 by 4.00 and ROUGE-L by 3.78, respectively.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Sport and Refactor Inconsistent Method Names

+

Kui Liu, Dongsun Kim, Tegawendé F. Bissyandé, Taeyoung Kim, Kisub Kim, Anil Koyuncu, Suntae Kim, Yves Le Traon. ICSE 2019

+

+ + + +
+ + naming + +

+

To ensure code readability and facilitate software maintenance, program methods must be named properly. In particular, method names must be consistent with the corresponding method implementations. Debugging method names remains an important topic in the literature, where various approaches analyze commonalities among method names in a large dataset to detect inconsistent method names and suggest better ones. We note that the state-of-the-art does not analyze the implemented code itself to assess consistency. We thus propose a novel automated approach to debugging method names based on the analysis of consistency between method names and method code. The approach leverages deep feature representation techniques adapted to the nature of each artifact. Experimental results on over 2.1 million Java methods show that we can achieve up to 15 percentage points improvement over the state-of-the-art, establishing a record performance of 67.9% F1-measure in identifying inconsistent method names. We further demonstrate that our approach yields up to 25% accuracy in suggesting full names, while the state-of-the-art lags far behind at 1.1% accuracy. Finally, we report on our success in fixing 66 inconsistent method names in a live study on projects in the wild.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural query expansion for code search

+

Jason Liu, Seohyun Kim, Vijayaraghavan Murali, Swarat Chaudhuri, Satish Chandra. MAPL 2019

+

+ + + +
+ + search + +

+

Searching repositories of existing source code for code snippets is a key task in software engineering. Over the years, many approaches to this problem have been proposed. One recent tool called NCS, takes in a natural language query and outputs relevant code snippets, often being able to correctly answer Stack Overflow questions. But what happens when the developer doesn’t provide a query with a clear intent? What if shorter queries are used to demonstrate a more vague intent?

+ +

We find that the performance of NCS regresses with shorter queries. Furthermore, data from developers’ code search history logs shows that shorter queries have a less successful code search session: there are more query reformulations and more time is spent browsing the results. These observations lead us to believe that using NCS alone with short queries may not be productive enough.

+ +

In this paper, we explore an additional way of using neural networks in code search: the automatic expansion of queries. We present NQE, a neural model that takes in a set of keywords and predicts a set of keywords to expand the query to NCS. NQE learns to predict keywords that co-occur with the query keywords in the underlying corpus, which helps expand the query in a productive way. Our results show that with query expansion, NQE + NCS is able to perform better than using NCS alone.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Automating Just-In-Time Comment Updating

+

Zhongxin Liu, Xin Xia, Meng Yan, Shanping Li. ASE 2020

+

+ + + +
+ + documentation + +

+

Code comments are valuable for program comprehension and software maintenance, and also require maintenance with code evolution. However, when changing code, developers sometimes neglect updating the related comments, bringing in inconsistent or obsolete comments (aka., bad comments). Such comments are detrimental since they may mislead developers and lead to future bugs. Therefore, it is necessary to fix and avoid bad comments. In this work, we argue that bad comments can be reduced and even avoided by automatically performing comment updates with code changes. We refer to this task as “Just-In-Time (JIT) Comment Updating” and propose an approach named CUP (Comment UPdater) to automate this task. CUP can be used to assist developers in updating comments during code changes and can consequently help avoid the introduction of bad comments. Specifically, CUP leverages a novel neural sequence-to-sequence model to learn comment update patterns from extant code-comment co-changes and can automatically generate a new comment based on its corresponding old comment and code change. Several customized enhancements, such as a special tokenizer and a novel co-attention mechanism, are introduced in CUP by us to handle the characteristics of this task. We build a dataset with over 108K comment-code co-change samples and evaluate CUP on it. The evaluation results show that CUP outperforms an information-retrieval-based and a rule-based baselines by substantial margins, and can reduce developers’ edits required for JIT comment updating. In addition, the comments generated by our approach are identical to those updated by developers in 1612 (16.7%) test samples, 7 times more than the best-performing baseline.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Open-ended Knowledge Tracing

+

Naiming Liu, Zichao Wang, Richard G. Baraniuk, Andrew Lan. 2022

+

+ + [ArXiV] + + [code] + + + +
+ + education + + code generation + +

+

In education applications, knowledge tracing refers to the problem of estimating students’ time-varying concept/skill mastery level from their past responses to questions and predicting their future performance. One key limitation of most existing knowledge tracing methods is that they treat student responses to questions as binary-valued, i.e., whether they are correct or incorrect. Response correctness analysis/prediction ignores important information on student knowledge contained in the exact content of the responses, especially for open-ended questions. In this paper, we conduct the first exploration into open-ended knowledge tracing (OKT) by studying the new task of predicting students’ exact open-ended responses to questions. Our work is grounded in the domain of computer science education with programming questions. We develop an initial solution to the OKT problem, a student knowledge-guided code generation approach, that combines program synthesis methods using language models with student knowledge tracing methods. We also conduct a series of quantitative and qualitative experiments on a real-world student code dataset to validate OKT and demonstrate its promise in educational applications.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code Execution with Pre-trained Language Models

+

Chenxiao Liu, Shuai Lu, Weizhu Chen, Daxin Jiang, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan, Nan Duan. 2023

+

+ + [ArXiV] + + + +
+ + Transformer + + execution + +

+

Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets

+

V. Lomshakov, S. Kovalchuk, M. Omelchenko, S. Nikolenko, A. Aliev. ICCS 2023

+

+ + [LNCS] + + [Papers with Code ] + + + +
+ + program synthesis + + question answering + + large language models + +

+

We study the ability of pretrained large language models (LLM) to answer questions from online question answering fora such as Stack Overflow. We consider question-answer pairs where the main part of the answer consists of source code. On two benchmark datasets — CoNaLa and a newly collected dataset based on Stack Overflow — we investigate how a closed-book question answering system can be improved by fine-tuning the LLM for the downstream task, prompt engineering, and data preprocessing. We use publicly available autoregressive language models such as GPT-Neo, CodeGen, and PanGu-Coder, and after the proposed fine-tuning achieve a BLEU score of 0.4432 on the CoNaLa test set, significantly exceeding previous state of the art for this task.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Learning to Detect Redundant Method Comments

+

Annie Louis, Santanu Kumar Dash, Earl T. Barr, Charles Sutton. 2018

+

+ + [ArXiV] + + + +
+ + bimodal + + documentation + +

+

Comments in software are critical for maintenance and reuse. But apart from prescriptive advice, there is little practical support or quantitative understanding of what makes a comment useful. In this paper, we introduce the task of identifying comments which are uninformative about the code they are meant to document. To address this problem, we introduce the notion of comment entailment from code, high entailment indicating that a comment’s natural language semantics can be inferred directly from the code. Although not all entailed comments are low quality, comments that are too easily inferred, for example, comments that restate the code, are widely discouraged by authorities on software style. Based on this, we develop a tool called CRAIC which scores method-level comments for redundancy. Highly redundant comments can then be expanded or alternately removed by the developer. CRAIC uses deep language models to exploit large software corpora without requiring expensive manual annotations of entailment. We show that CRAIC can perform the comment entailment task with good agreement with human judgements. Our findings also have implications for documentation tools. For example, we find that common tags in Javadoc are at least two times more predictable from code than non-Javadoc sentences, suggesting that Javadoc tags are less informative than more free-form comments

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Where should I comment my code? A dataset and model for predicting locations that need comments

+

Annie Louis, Santanu Kumar Dash, Earl T. Barr, Charles Sutton. International Conference on Software Engineering (ICSE; NIER track) 2020

+

+ + [ArXiV] + + [Data] + + + +
+ + bimodal + + documentation + +

+

Programmers should write code comments, but not on every line +of code. We have created a machine learning model that suggests +locations where a programmer should write a code comment. We +trained it on existing commented code to learn locations that are +chosen by developers. Once trained, the model can predict locations +in new code. Our models achieved precision of 74% and recall of +13% in identifying comment-worthy locations. This first success +opens the door to future work, both in the new where-to-comment +problem and in guiding comment generation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes

+

Pablo Loyola, Edison Marrese-Taylor, Yutaka Matsuo. 2017

+

+ + [ArXiV] + + + +
+ + edit + + summarization + +

+

We propose a model to automatically describe changes introduced in the source code of a program using natural language. Our method receives as input a set of code commits, which contains both the modifications and message introduced by an user. These two modalities are used to train an encoder-decoder architecture. We evaluated our approach on twelve real world open source projects from four different programming languages. Quantitative and qualitative results showed that the proposed approach can generate feasible and semantically sound descriptions not only in standard in-project settings, but also in a cross-project setting.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Content Aware Source Code Change Description Generation

+

Pablo Loyola, Edison Marrese-Taylor, Jorge Balazs, Yutaka Matsuo, Fumiko Satoh. International Natural Language Generation Conference 2018

+

+ + + +
+ + edit + + summarization + +

+

We propose to study the generation of descriptions from source code changes by integrating the messages included on code +commits and the intra-code documentation +inside the source in the form of docstrings. +Our hypothesis is that although both types +of descriptions are not directly aligned in +semantic terms —one explaining a change +and the other the actual functionality of +the code being modified— there could be +certain common ground that is useful for +the generation. To this end, we propose +an architecture that uses the source code-docstring relationship to guide the description generation. We discuss the results of +the approach comparing against a baseline +based on a sequence-to-sequence model, +using standard automatic natural language +generation metrics as well as with a human +study, thus offering a comprehensive view +of the feasibility of the approach.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Program Classification Using Gated Graph Attention Neural Network for Online Programming Service

+

Mingming Lu, Dingwu Tan, Naixue Xiong, Zailiang Chen, Haifeng Li. 2019

+

+ + [ArXiV] + + + +
+ + GNN + + representation + +

+

The online programing services, such as Github, TopCoder, and EduCoder, have promoted a lot of social interactions among the service users. However, the existing social interactions is rather limited and inefficient due to the rapid increasing of source-code repositories, which is difficult to explore manually. The emergence of source-code mining provides a promising way to analyze those source codes, so that those source codes can be relatively easy to understand and share among those service users. Among all the source-code mining attempts,program classification lays a foundation for various tasks related to source-code understanding, because it is impossible for a machine to understand a computer program if it cannot classify the program correctly. Although numerous machine learning models, such as the Natural Language Processing (NLP) based models and the Abstract Syntax Tree (AST) based models, have been proposed to classify computer programs based on their corresponding source codes, the existing works cannot fully characterize the source codes from the perspective of both the syntax and semantic information. To address this problem, we proposed a Graph Neural Network (GNN) based model, which integrates data flow and function call information to the AST,and applies an improved GNN model to the integrated graph, so as to achieve the state-of-art program classification accuracy. The experiment results have shown that the proposed work can classify programs with accuracy over 97%.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

+

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu. 2021

+

+ + [ArXiV] + + + +
+ + benchmark + + Transformer + +

+

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

ReACC: A Retrieval-Augmented Code Completion Framework

+

Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, Alexey Svyatkovskiy. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + autocomplete + +

+

Code completion, which aims to predict the following code token(s) according to the code context, can improve the productivity of software development. Recent work has proved that statistical language modeling with transformers can greatly improve the performance in the code completion task via learning from large-scale source code datasets. However, current approaches focus only on code context within the file or project, i.e. internal context. Our distinction is utilizing “external” context, inspired by human behaviors of copying from the related code snippets when writing code. Specifically, we propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We adopt a stage-wise training approach that combines a source code retriever and an auto-regressive language model for programming language. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Aroma: code recommendation via structural code search

+

Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, Satish Chandra. PACMPL 2015

+

+ + + +
+ + search + +

+

Programmers often write code that has similarity to existing code written somewhere. A tool that could help programmers to search such similar code would be immensely useful. Such a tool could help programmers to extend partially written code snippets to completely implement necessary functionality, help to discover extensions to the partial code which are commonly included by other programmers, help to cross-check against similar code written by other programmers, or help to add extra code which would fix common mistakes and errors. We propose Aroma, a tool and technique for code recommendation via structural code search. Aroma indexes a huge code corpus including thousands of open-source projects, takes a partial code snippet as input, searches the corpus for method bodies containing the partial code snippet, and clusters and intersects the results of the search to recommend a small set of succinct code snippets which both contain the query snippet and appear as part of several methods in the corpus. We evaluated Aroma on 2000 randomly selected queries created from the corpus, as well as 64 queries derived from code snippets obtained from Stack Overflow, a popular website for discussing code. We implemented Aroma for 4 different languages, and developed an IDE plugin for Aroma. Furthermore, we conducted a study where we asked 12 programmers to complete programming tasks using Aroma, and collected their feedback. Our results indicate that Aroma is capable of retrieving and recommending relevant code snippets efficiently.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Structured Generative Models of Natural Source Code

+

Chris J. Maddison, Daniel Tarlow. ICML 2014

+

+ + + +
+ + language model + + code generation + + grammar + + grammar + +

+

We study the problem of building generative +models of natural source code (NSC); that is, +source code written by humans and meant to +be understood by humans. Our primary con- +tribution is to describe new generative models +that are tailored to NSC. The models are based +on probabilistic context free grammars (PCFGs) +and neuro-probabilistic language models (Mnih +& Teh, 2012), which are extended to incorporate +additional source code-specific structure. These +models can be efficiently trained on a corpus +of source code and outperform a variety of less +structured baselines in terms of predictive log +likelihoods on held-out data.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors

+

Junayed Mahmud, Fahim Faisal, Raihan Islam Arnob, Antonios Anastasopoulos, Kevin Moran. NLP4Prog 2021

+

+ + [PDF] + + + +
+ + survey + + summarization + + Transformer + +

+

Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to “translate” code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an error taxonomy that can be used to drive future research efforts.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

NL2Type: Inferring JavaScript Function Types from Natural Language Information

+

Rabee Sohail Malik, Jibesh Patra, Michael Pradel. ICSE 2019

+

+ + + +
+ + bimodal + + types + +

+

JavaScript is dynamically typed and hence lacks thetype safety of statically typed languages, +leading to suboptimal IDE support, difficult to understand APIs, and unexpected run-time behavior. +Several gradual type systems have been proposed, e.g., Flow and TypeScript, but they rely on developers +to annotatecode with types. This paper presents NL2Type, a learning-based approach for predicting likely +type signatures of JavaScript functions. The key idea is to exploit natural language information in +source code, such as comments, function names, and parameternames, a rich source of knowledge +that is typically ignored by type inference algorithms. We formulate the problem of predicting +types as a classification problem and train a recurrent, LSTM-based neural model that, after learning +from an annotatedcode base, predicts function types for unannotated code. We evaluate the +approach with a corpus of 162,673 JavaScript files from real-world projects. +NL2Type predicts types with aprecision of 84.1% and a recall of 78.9% when considering only +the top-most suggestion, and with a precision of 95.5% and arecall of 89.6% when +considering the top-5 suggestions. The +approach outperforms both JSNice, a state-of-the-art approach that analyzes implementations +of functions instead of natural language information, and DeepTyper, a recent type prediction +approach that is also based on deep learning. Beyond predicting types, NL2Type serves as a +consistency checker for existing type annotations. We show that it discovers 39 inconsistencies +that deserve developer attention (from a manual analysis of 50 warnings), most of which +are due to incorrect type annotations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Static Neural Compiler Optimization via Deep Reinforcement Learning

+

Rahim Mammadli, Ali Jannesari, Felix Wolf. 2020

+

+ + [ArXiV] + + + +
+ + compilation + +

+

The phase-ordering problem of modern compilers has received a lot of attention from the research community over the years, yet remains largely unsolved. Various optimization sequences exposed to the user are manually designed by compiler developers. In designing such a sequence developers have to choose the set of optimization passes, their parameters and ordering within a sequence. Resulting sequences usually fall short of achieving optimal runtime for a given source code and may sometimes even degrade the performance when compared to unoptimized version. In this paper, we employ a deep reinforcement learning approach to the phase-ordering problem. Provided with sub-sequences constituting LLVM’s O3 sequence, our agent learns to outperform the O3 sequence on the set of source codes used for training and achieves competitive performance on the validation set, gaining up to 1.32x speedup on previously-unseen programs. Notably, our approach differs from autotuning methods by not depending on one or more test runs of the program for making successful optimization decisions. It has no dependence on any dynamic feature, but only on the statically-attainable intermediate representation of the source code. We believe that the models trained using our approach can be integrated into modern compilers as neural optimization agents, at first to complement, and eventually replace the hand-crafted optimization sequences.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A User-Guided Approach to Program Analysis

+

Ravi Mangal, Xin Zhang, Aditya V. Nori, Mayur Naik. FSE 2015

+

+ + + +
+ + program analysis + +

+

Program analysis tools often produce undesirable output +due to various approximations. We present an approach +and a system Eugene that allows user feedback to guide +such approximations towards producing the desired output. +We formulate the problem of user-guided program analysis in terms of solving a combination of hard rules and soft +rules: hard rules capture soundness while soft rules capture +degrees of approximations and preferences of users. Our +technique solves the rules using an off-the-shelf solver in a +manner that is sound (satisfies all hard rules), optimal (maximally satisfies soft rules), and scales to real-world analy- +ses and programs. We evaluate Eugene on two different +analyses with labeled output on a suite of seven Java pro- +grams of size 131–198 KLOC. We also report upon a user +study involving nine users who employ Eugene to guide an +information-flow analysis on three Java micro-benchmarks. +In our experiments, Eugene significantly reduces misclassified reports upon providing limited amounts of feedback.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Topic modeling of public repositories at scale using names in source code

+

Vadim Markovtsev, Eiso Kant. 2017

+

+ + [ArXiV] + + [website] + + [code] + + + +
+ + topic modeling + + pattern mining + +

+

Programming languages themselves have a limited number of reserved keywords and character based tokens that +define the language specification. However, programmers have a rich use of natural language within their code +through comments, text literals and naming entities. The programmer defined names that can be found in source +code are a rich source of information to build a high level understanding of the project. The goal of this paper +is to apply topic modeling to names used in over 13.6 million repositories and perceive the inferred topics. +One of the problems in such a study is the occurrence of duplicate repositories not officially marked as forks (obscure forks). +We show how to address it using the same identifiers which are extracted for topic modeling.

+ +

We open with a discussion on naming in source code, we then elaborate on our approach to remove exact duplicate +and fuzzy duplicate repositories using Locality Sensitive Hashing on the bag-of-words model and then discuss our work +on topic modeling; and finally present the results from our data analysis together with open-access to the source code, +tools and datasets.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Public Git Archive: a Big Code dataset for all

+

Vadim Markovtsev, Waren Long. MSR 2018

+

+ + [ArXiV] + + [GitHub] + + [data] + + + +
+ + dataset + +

+

The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research potential enabled by the available open source code is clearly substantial, no significant large-scale open source code datasets exist. In this paper, we present the Public Git Archive – dataset of 182,014 top-bookmarked Git repositories from GitHub. We describe the novel data retrieval pipeline to reproduce it. We also elaborate on the strategy for performing dataset updates and legal issues. The Public Git Archive occupies 3.0 TB on disk and is an order of magnitude larger than the current source code datasets. The dataset is made available through HTTP and provides the source code of the projects, the related metadata, and development history. The data retrieval pipeline employs an optimized worker queue model and an optimized archive format to efficiently store forked Git repositories, reducing the amount of data to download and persist. Public Git Archive aims to open a myriad of new opportunities for Big Code research.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms

+

Vadim Markovtsev, Waren Long, Hugo Mougard, Konstantin Slavnov, Egor Bulychev. MSR 2019

+

+ + [ArXiV] + + + +
+ + style + +

+

Source code reviews are manual, time-consuming, and expensive. Human involvement should be focused on analyzing the most relevant aspects of the program, such as logic and maintainability, rather than amending style, syntax, or formatting defects. Some tools with linting capabilities can format code automatically and report various stylistic violations for supported programming languages. They are based on rules written by domain experts, hence, their configuration is often tedious, and it is impractical for the given set of rules to cover all possible corner cases. Some machine learning-based solutions exist, but they remain uninterpretable black boxes. This paper introduces STYLE-ANALYZER, a new open source tool to automatically fix code formatting violations using the decision tree forest model which adapts to each codebase and is fully unsupervised. STYLE-ANALYZER is built on top of our novel assisted code review framework, Lookout. It accurately mines the formatting style of each analyzed Git repository and expresses the found format patterns with compact human-readable rules. STYLE-ANALYZER can then suggest style inconsistency fixes in the form of code review comments. We evaluate the output quality and practical relevance of STYLE-ANALYZER by demonstrating that it can reproduce the original style with high precision, measured on 19 popular JavaScript projects, and by showing that it yields promising results in fixing real style mistakes. STYLE-ANALYZER includes a web application to visualize how the rules are triggered. We release STYLE-ANALYZER as a reusable and extendable open source software package on GitHub for the benefit of the community.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Using Deep Learning to Generate Complete Log Statements

+

Antonio Mastropaolo, Luca Pascarella, Gabriele Bavota. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + logging + +

+

Logging is a practice widely adopted in several phases of the software lifecycle. For example, during software development log statements allow engineers to verify and debug the system by exposing fine-grained information of the running software. While the benefits of logging are undisputed, taking proper decisions about where to inject log statements, what information to log, and at which log level (e.g., error, warning) is crucial for the logging effectiveness. In this paper, we present LANCE (Log stAtemeNt reCommEnder), the first approach supporting developers in all these decisions. LANCE features a Text-To-Text-Transfer-Transformer (T5) model that has been trained on 6,894,456 Java methods. LANCE takes as input a Java method and injects in it a full log statement, including a human-comprehensible logging message and properly choosing the needed log level and the statement location. Our results show that LANCE is able to (i) properly identify the location in the code where to inject the statement in 65.9% of Java methods requiring it; (ii) selecting the proper log level in 66.2% of cases; and (iii) generate a completely correct log statement including a meaningful logging message in 15.2% of cases.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks

+

Nikita Mehrotra, Navdha Agarwal, Piyush Gupta, Saket Anand, David Lo, Rahul Purandare. 2020

+

+ + [ArXiV] + + + +
+ + clone + + GNN + +

+

Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted in the past to detect clones. A majority of these approaches use lexical and syntactic information to detect clones. However, only a few of them target semantic clones. Recently, motivated by the success of deep learning models in other fields, including natural language processing and computer vision, researchers have attempted to adopt deep learning techniques to detect code clones. These approaches use lexical information (tokens) and(or) syntactic structures like abstract syntax trees (ASTs) to detect code clones. However, they do not make sufficient use of the available structural and semantic information hence, limiting their capabilities.

+ +

This paper addresses the problem of semantic code clone detection using program dependency graphs and geometric neural networks, leveraging the structured syntactic and semantic information. We have developed a prototype tool HOLMES, based on our novel approach, and empirically evaluated it on popular code clone benchmarks. Our results show that HOLMES performs considerably better than the other state-of-the-art tool, TBCCD. We also evaluated HOLMES on unseen projects and performed cross dataset experiments to assess the generalizability of HOLMES. Our results affirm that HOLMES outperforms TBCCD since most of the pairs that HOLMES detected were either undetected or suboptimally reported by TBCCD.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Machine Learning Framework for Programming by Example

+

Aditya Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, Adam Kalai. ICML 2013

+

+ + + +
+ + code generation + +

+

Learning programs is a timely and interesting challenge. In Programming by Example +(PBE), a system attempts to infer a program +from input and output examples alone, by +searching for a composition of some set of +base functions. We show how machine learning can be used to speed up this seemingly +hopeless search problem, by learning weights +that relate textual features describing the +provided input-output examples to plausible +sub-components of a program. This generic +learning framework lets us address problems +beyond the scope of earlier PBE systems. +Experiments on a prototype implementation +show that learning improves search and ranking on a variety of text processing tasks found +on help forums.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepDelta: Learning to Repair Compilation Errors

+

Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, Edward Aftandilian.. 2019

+

+ + + +
+ + repair + + edit + + compilation + +

+

Programmers spend a substantial amount of time manually repairing +code that does not compile. We observe that the repairs for +any particular error class typically follow a pattern and are highly +mechanical. We propose a novel approach that automatically learns +these patterns with a deep neural network and suggests program +repairs for the most costly classes of build-time compilation failures. +We describe how we collect all build errors and the human-authored, +in-progress code changes that cause those failing builds to transition +to successful builds at Google. We generate an AST diff from the +textual code changes and transform it into a domain-specific +language called Delta that encodes the change that must be made +to make the code compile. We then feed the compiler diagnostic +information (as source) and the Delta changes that resolved the +diagnostic (as target) into a Neural Machine Translation network for +training. For the two most prevalent and costly classes of Java compilation errors, +namely missing symbols and mismatched methodsignatures, our system called DeepDelta, +generates the correct repair changes for 19,314 out of 38,788 (50%) of unseen compilation +errors. The correct changes are in the top three suggested axes 86% of the time on average.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference

+

Amir M. Mir, Evaldas Latoskinas, Georgios Gousios. MSR 2021

+

+ + [ArXiV] + + [Dataset] + + + +
+ + dataset + + types + +

+

In this paper, we present ManyTypes4Py, a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5,382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of the duplication bias. To facilitate training and evaluation of ML models, the dataset was split into training, validation and test sets by files. To extract type information from abstract syntax trees (ASTs), a lightweight static analyzer pipeline is developed and accompanied with the dataset. Using this pipeline, the collected Python projects were analyzed and the results of the AST analysis were stored in JSON-formatted files. The ManyTypes4Py dataset is shared on zenodo and its tools are publicly available on GitHub.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Type4Py: Deep Similarity Learning-Based Type Inference for Python

+

Amir M. Mir, Evaldas Latoskinas, Sebastian Proksch, Georgios Gousios. 2021

+

+ + [ArXiV] + + [GitHub] + + + +
+ + types + +

+

Dynamic languages, such as Python and Javascript, trade static typing for developer flexibility. While this allegedly enables greater productivity, lack of static typing can cause runtime exceptions, type inconsistencies, and is a major factor for weak IDE support. To alleviate these issues, PEP 484 introduced optional type annotations for Python. As retrofitting types to existing codebases is error-prone and laborious, learning-based approaches have been proposed to enable automatic type annotations based on existing, partially annotated codebases. However, the prediction of rare and user-defined types is still challenging. In this paper, we present Type4Py, a deep similarity learning-based type inference model for Python. We design a hierarchical neural network model that learns to discriminate between types of the same kind and dissimilar types in a high-dimensional space, which results in clusters of types. Nearest neighbor search suggests likely type signatures of given Python functions. The types visible to analyzed modules are surfaced using lightweight dependency analysis. The results of quantitative and qualitative evaluation indicate that Type4Py significantly outperforms state-of-the-art approaches at the type prediction task. Considering the Top-1 prediction, Type4Py obtains 19.33% and 13.49% higher precision than Typilus and TypeWriter, respectively, while utilizing a much bigger vocabulary.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models

+

Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi Wei, Alvine Boaye Belle, Hung Viet Pham, Song Wang. 2023

+

+ + [ArXiV] + + + +
+ + repair + +

+

We introduce SkipAnalyzer, a large language model (LLM)-powered tool for static code analysis. SkipAnalyzer has three components: 1) an LLM-based static bug detector that scans source code and reports specific types of bugs, 2) an LLM-based false-positive filter that can identify false-positive bugs in the results of static bug detectors (e.g., the result of step 1) to improve detection accuracy, and 3) an LLM-based patch generator that can generate patches for the detected bugs above. As a proof-of-concept, SkipAnalyzer is built on ChatGPT, which has exhibited outstanding performance in various software engineering tasks. To evaluate SkipAnalyzer, we focus on two types of typical and critical bugs that are targeted by static bug detection, i.e., Null Dereference and Resource Leak as subjects. We employ Infer to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that SkipAnalyzer achieves remarkable performance in the mentioned static analysis tasks, including bug detection, false-positive warning removal, and bug repair. In static bug detection, SkipAnalyzer achieves accuracy values of up to 68.37% for detecting Null Dereference bugs and 76.95% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer, by 12.86% and 43.13%, respectively. For removing false-positive warnings, SkipAnalyzer can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs. Additionally, SkipAnalyzer surpasses state-of-the-art false-positive warning removal tools. Furthermore, in bug repair, SkipAnalyzer can generate syntactically correct patches to fix its detected bugs with a success rate of up to 97.30%.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size

+

Martin Monperrus, Matias Martinez, He Ye, Fernanda Madeiral, Thomas Durieux, Zhongxing Yu. 2021

+

+ + [ArXiV] + + [Dataset] + + + +
+ + dataset + + edit + +

+

This paper presents Megadiff, a dataset of source code diffs. It focuses on Java, with strict inclusion criteria based on commit message and diff size. Megadiff contains 663 029 Java diffs that can be used for research on commit comprehension, fault localization, automated program repair, and machine learning on code changes.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Building Program Vector Representations for Deep Learning

+

Hao Peng, Lili Mou, Ge Li, Yuxuan Liu, Lu Zhang, Zhi Jin.. International Conference on Knowledge Science, Engineering and Management 2014

+

+ + + +
+ + representation + + grammar + +

+

Deep learning has made significant breakthroughs +in various fields of artificial intelligence. Advantages of deep +learning include the ability to capture highly complicated features, weak involvement of human engineering, etc. However, +it is still virtually impossible to use deep learning to analyze +programs since deep architectures cannot be trained effectively +with pure back propagation. In this pioneering paper, we propose +the “coding criterion” to build program vector representations, +which are the premise of deep learning for program analysis. Our +representation learning approach directly makes deep learning a +reality in this new field. We evaluate the learned vector representations both qualitatively and quantitatively. We conclude, based +on the experiments, the coding criterion is successful in building +program representations. To evaluate whether deep learning +is beneficial for program analysis, we feed the representations +to deep neural networks, and achieve higher accuracy in the +program classification task than “shallow” methods, such as +logistic regression and the support vector machine. This result +confirms the feasibility of deep learning to analyze programs. It +also gives primary evidence of its success in this new field. We +believe deep learning will become an outstanding technique for +program analysis in the near future.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Convolutional Neural Networks over Tree Structures for Programming Language Processing

+

Lili Mou, Ge Li, Lu Zhang, Tao Wang, Zhi Jin. AAAI 2016

+

+ + + +
+ + representation + + grammar + +

+

Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the +artificial intelligence community. However, different from a +natural language sentence, a program contains rich, explicit, +and complicated structural information. Hence, traditional +NLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in +which a convolution kernel is designed over programs’ abstract syntax trees to capture structural information. TBCNN +is a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according +to functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Natural Language Models for Predicting Programming Comments

+

Dana Movshovitz-Attias, William W. Cohen. ACL 2013

+

+ + + +
+ + bimodal + + documentation + + summarization + +

+

Statistical language models have successfully been used to describe and analyze +natural language documents. Recent work +applying language models to programming languages is focused on the task +of predicting code, while mainly ignoring +the prediction of programmer comments. +In this work, we predict comments from +JAVA source files of open source projects, +using topic models and n-grams, and we +analyze the performance of the models +given varying amounts of background data +on the project being predicted. We evaluate models on their comment-completion +capability in a setting similar to code completion tools built into standard code +editors, and show that using a comment +completion tool can save up to 47% of the +comment typing.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts

+

Dana Movshovitz-Attias, William W. Cohen. ACL 2015

+

+ + + +
+ + pattern mining + +

+

Many existing knowledge bases (KBs), including Freebase, Yago, and NELL, rely +on a fixed ontology, given as an input +to the system, which defines the data to +be cataloged in the KB, i.e., a hierarchy of categories and relations between +them. The system then extracts facts that +match the predefined ontology. We propose an unsupervised model that jointly +learns a latent ontological structure of an +input corpus, and identifies facts from the +corpus that match the learned structure. +Our approach combines mixed membership stochastic block models and topic +models to infer a structure by jointly modeling text, a latent concept hierarchy, and +latent semantic relationships among the +entities mentioned in the text. As a case +study, we apply the model to a corpus +of Web documents from the software domain, and evaluate the accuracy of the various components of the learned ontology.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

OctoPack: Instruction Tuning Code Large Language Models

+

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, Shayne Longpre. 2023

+

+ + [ArXiV] + + + +
+ + dataset + + instruction tuning + +

+

Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack’s benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Searching a Database of Source Codes Using Contextualized Code Search

+

Rohan Mukherjee, Swarat Chaudhuri, Chris Jermaine. 2020

+

+ + [ArXiV] + + + +
+ + search + + representation + +

+

We assume a database containing a large set of program source codes and consider the problem of contextualized code search over that database. A programmer has written some part of a program, but has left part of the program (such as a method or a function body) incomplete. The goal is to use the context surrounding the missing code to automatically ‘figure out’ which of the codes in the database would be useful to the programmer in order to help complete the missing code, in the sense that the programmer could either re-purpose the retrieved code and use the re-purposed code to fill the missing spot in the program. Or, the user could use the retrieved code as a model for implementing the missing code. The search is ‘contextualized’ in the sense that the search engine should use clues in the partially-completed code to figure out which database code is most useful. The user should not be required to formulate an explicit query.

+ +

We cast contextualized code search as a learning problem, where the goal is to learn a distribution function computing the likelihood that each database code completes the program, and propose a neural model for predicting which database code is likely to be most useful. Because it will be prohibitively expensive to apply a neural model to each code in a database of millions or billions of codes at search time, one of our key technical concerns is ensuring a speedy search. We address this by learning a ‘reverse encoder’ that can be used to reduce the problem of evaluating each database code to computing a convolution of two normal distributions, making it possible to search a large database of codes in a reasonable time.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Program Generation Modulo Static Analysis

+

Rohan Mukherjee, Yeming Wen, Dipak Chaudhari, Thomas W. Reps, Swarat Chaudhuri, Chris Jermaine. NeurIPS 2021

+

+ + [Preprint] + + + +
+ + synthesis + + language model + +

+

State-of-the-art neural models of source code tend to be evaluated on the generation +of individual expressions and lines of code, and commonly fail on long-horizon +tasks such as the generation of entire method bodies. We propose to address this +deficiency using weak supervision from a static program analyzer. Our neurosymbolic method allows a deep generative model to symbolically compute, using calls +to a static-analysis tool, long-distance semantic relationships in the code that it +has already generated. During training, the model observes these relationships +and learns to generate programs conditioned on them. We apply our approach to +the problem of generating entire Java methods given the remainder of the class +that contains the method. Our experiments show that the approach substantially +outperforms state-of-the-art transformers and a model that explicitly tries to learn +program semantics on this task, both in terms of producing programs free of basic +semantic errors and in terms of syntactically matching the ground truth.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Bayesian Sketch Learning for Program Synthesis

+

Vijayaraghavan Murali, Letao Qi, Swarat Chaudhuri, Chris Jermaine. ICLR 2018

+

+ + [ArXiV] + + + +
+ + code generation + + API + +

+

We present a Bayesian statistical approach to the problem of automatic program synthesis. Our synthesizer starts +by learning, offline and from an existing corpus, a probabilistic model of real-world programs. During synthesis, +it is provided some ambiguous and incomplete evidence about the nature of the programming task that the user +wants automated, for example sets of API calls or data types that are relevant for the task. Given this input, the +synthesizer infers a posterior distribution over type-safe programs that assigns higher likelihood to programs +that, according to the learned model, are more likely to match the evidence.

+ +

We realize this approach using two key ideas. First, our learning techniques operate not over code but +syntactic abstractions, or sketches, of programs. During synthesis, we infer a posterior distribution over sketches, +then concretize samples from this distribution into type-safe programs using combinatorial techniques. Second, +our statistical model explicitly models the full intent behind a synthesis task as a latent variable. To infer +sketches, we first estimate a posterior distribution on the intent, then use samples from this posterior to generate +a distribution over possible sketches. We show that our model can be implemented effectively using the new +neural architecture of Bayesian encoder-decoders, which can be trained with stochastic gradient descent and +yields a simple inference procedure.

+ +

We implement our ideas in a system, called BAYOU , for the synthesis of API-heavy Java methods. We train +BAYOU on a large corpus of Android apps, and find that the trained system can often synthesize complex +methods given just a few API method names or data types as evidence. The experiments also justify the design +choice of using a latent intent variable and the levels of abstraction at which sketches and evidence are defined.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Finding Likely Errors with Bayesian Specifications

+

Vijayaraghavan Murali, Swarat Chaudhuri, Chris Jermaine. 2017

+

+ + [ArXiV] + + + +
+ + program analysis + + API + +

+

We present a Bayesian framework for learning probabilistic specifications from large, unstructured code corpora, and +a method to use this framework to statically detect anomalous, hence likely buggy, program behavior. The distinctive +insight here is to build a statistical model that correlates all +specifications hidden inside a corpus with the syntax and +observed behavior of programs that implement these specifications. During the analysis of a particular program, this +model is conditioned into a posterior distribution that prioritizes specifications that are relevant to this program. This +allows accurate program analysis even if the corpus is highly +heterogeneous. The problem of finding anomalies is now +framed quantitatively, as a problem of computing a distance +between a “reference distribution” over program behaviors +that our model expects from the program, and the distribution over behaviors that the program actually produces.

+ +

We present a concrete embodiment of our framework that +combines a topic model and a neural network model to learn +specifications, and queries the learned models to compute +anomaly scores. We evaluate this implementation on the +task of detecting anomalous usage of Android APIs. Our +encouraging experimental results show that the method can +automatically discover subtle errors in Android applications +in the wild, and has high precision and recall compared to +competing probabilistic approaches.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeDSI: Differentiable Code Search

+

Usama Nadeem, Noah Ziems, Shaoen Wu. 2022

+

+ + [ArXiV] + + + +
+ + search + +

+

Reimplementing solutions to previously solved software engineering problems is not only inefficient but also introduces inadequate and error-prone code. Many existing methods achieve impressive performance on this issue by using autoregressive text-generation models trained on code. However, these methods are not without their flaws. The generated code from these models can be buggy, lack documentation, and introduce vulnerabilities that may go unnoticed by developers. An alternative to code generation – neural code search – is a field of machine learning where a model takes natural language queries as input and, in turn, relevant code samples from a database are returned. Due to the nature of this pre-existing database, code samples can be documented, tested, licensed, and checked for vulnerabilities before being used by developers in production. In this work, we present CodeDSI, an end-to-end unified approach to code search. CodeDSI is trained to directly map natural language queries to their respective code samples, which can be retrieved later. In an effort to improve the performance of code search, we have investigated docid representation strategies, impact of tokenization on docid structure, and dataset sizes on overall code search performance. Our results demonstrate CodeDSI strong performance, exceeding conventional robust baselines by 2-6% across varying dataset sizes.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis

+

Shounak Naik, Rajaswa Patil, Swati Agarwal, Veeky Baths. International Conference on Advanced Data Mining and Applications (ADMA 2022) 2022

+

+ + [ArXiV] + + [PDF] + + [Code] + + + +
+ + interpretability + + language model + + evaluation + + Transformer + +

+

Representational Similarity Analysis is a method from cognitive neuroscience, which helps in comparing representations from two different sources of data. In this paper, we propose using Representational Similarity Analysis to probe the semantic grounding in language models of code. We probe representations from the CodeBERT model for semantic grounding by using the data from the IBM CodeNet dataset. Through our experiments, we show that current pre-training methods do not induce semantic grounding in language models of code, and instead focus on optimizing form-based patterns. We also show that even a little amount of fine-tuning on semantically relevant tasks increases the semantic grounding in CodeBERT significantly. Our ablations with the input modality to the CodeBERT model show that using bimodal inputs (code and natural language) over unimodal inputs (only code) gives better semantic grounding and sample efficiency during semantic fine-tuning. Finally, our experiments with semantic perturbations in code reveal that CodeBERT is able to robustly distinguish between semantically correct and incorrect code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

funcGNN: A Graph Neural Network Approach to Program Similarity

+

Aravind Nair, Avijit Roy, Karl Meinke. ESEM 2020

+

+ + [ArXiV] + + + +
+ + GNN + + clone + +

+

Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. We introduce funcGNN, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between high-level language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (0.00194), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with the state of the art methods. funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Lexical Statistical Machine Translation for Language Migration

+

Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen. FSE 2013

+

+ + + +
+ + migration + + API + +

+

Prior research has shown that source code also exhibits naturalness, i.e. it is written by humans and is likely to be +repetitive. The researchers also showed that the n-gram language model is useful in predicting the next token in a source +file given a large corpus of existing source code. In this paper, we investigate how well statistical machine translation +(SMT) models for natural languages could help in migrating source code from one programming language to another. +We treat source code as a sequence of lexical tokens and +apply a phrase-based SMT model on the lexemes of those +tokens. Our empirical evaluation on migrating two Java +projects into C# showed that lexical, phrase-based SMT +could achieve high lexical translation accuracy ( BLEU from +81.3-82.6%). Users would have to manually edit only 11.9-15.8% of the total number of tokens in the resulting code to +correct it. However, a high percentage of total translation +methods (49.5-58.6%) is syntactically incorrect. Therefore, +our result calls for a more program-oriented SMT model that +is capable of better integrating the syntactic and semantic +information of a program to support language migration.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Statistical Semantic Language Model for Source Code

+

Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, Tien N. Nguyen. FSE 2013

+

+ + + +
+ + language model + +

+

Recent research has successfully applied the statistical n-gram language model to show that source code exhibits a +good level of repetition. The n-gram model is shown to have +good predictability in supporting code suggestion and completion. However, the state-of-the-art n-gram approach to +capture source code regularities/patterns is based only on +the lexical information in a local context of the code units. +To improve predictability, we introduce SLAMC, a novel statistical semantic language model for source code. It incorporates semantic information into code tokens and models the +regularities/patterns of such semantic annotations, called sememes, rather than their lexemes. It combines the local context in semantic n-grams with the global technical concerns/functionality into an n-gram topic model, together with pairwise associations of program elements. Based on SLAMC, +we developed a new code suggestion method, which is empirically evaluated on several projects to have relatively 18–68% +higher accuracy than the state-of-the-art approach.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Study of Repetitiveness of Code Changes in Software Evolution

+

Hoan Anh Nguyen, Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen, and Hridesh Rajan. ASE 2013

+

+ + + +
+ + edit + +

+

In this paper, we present a large-scale study of +repetitiveness of code changes in software evolution. We collected +a large data set of 2,841 Java projects, with 1.7 billion source lines +of code (SLOC) at the latest revisions, 1.8 million code change +revisions (0.4 million fixes), 6.2 million changed files, and 2.5 +billion changed SLOCs. A change is considered repeated within +or cross-project if it matches another change having occurred +in the history of the project or another project, respectively. We +report the following important findings. First, repetitiveness of +changes could be as high as 70–100% at small sizes and decreases +exponentially as size increases. Second, repetitiveness is higher +and more stable in the cross-project setting than in the project-within one. Third, fixing changes repeat similarly to general +changes. Importantly, learning code changes and recommending +them in software evolution is beneficial with accuracy for top-1 +recommendation of over 30% and top-3 of nearly 35%. Repeated +fixing changes could also be useful for automatic program repair.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Statistical Learning Approach for Mining API Usage Mappings for Code Migration

+

Anh Tuan Nguyen, Hoan Anh Nguyen, Tung Thanh Nguyen, Tien N. Nguyen. ASE 2014

+

+ + + +
+ + migration + + API + +

+

The same software product nowadays could appear in multiple platforms and devices. To address business needs, software companies +develop a software product in a programming language and then +migrate it to another one. To support that process, semi-automatic +migration tools have been proposed. However, they require users +to manually define the mappings between the respective APIs of +the libraries used in two languages. To reduce such manual effort, +we introduce StaMiner, a novel data-driven approach that statistically learns the mappings between APIs from the corpus of the +corresponding client code of the APIs in two languages Java and +C#. Instead of using heuristics on the textual or structural similarity +between APIs in two languages to map API methods and classes +as in existing mining approaches, StaMiner is based on a statistical +model that learns the mappings in such a corpus and provides mappings for APIs with all possible arities. Our empirical evaluation +on several projects shows that StaMiner can detect API usage mappings with higher accuracy than a state-of-the-art approach. With +the resulting API mappings mined by StaMiner, Java2CSharp, an +existing migration tool, could achieve a higher level of accuracy.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code

+

Anh Tuan Nguyen, Tung Thanh Nguyen, Tien N. Nguyen. ASE 2014

+

+ + + +
+ + migration + +

+

Prior research shows that directly applying phrase-based SMT on lexical tokens to migrate Java to C# produces +much semantically incorrect code. A key limitation is the use of +sequences in phrase-based SMT to model and translate source +code with well-formed structures. We propose mppSMT, a divideand-conquer technique to address that with novel training and migration algorithms using phrase-based SMT in three phases. First, +mppSMT treats a program as a sequence of syntactic units and +maps/translates such sequences in two languages to one another. +Second, with a syntax-directed fashion, it deals with the tokens +within syntactic units by encoding them with semantic symbols to +represent their data and token types. This encoding via semantic +symbols helps better migration of API usages. Third, the lexical +tokens corresponding to each sememe are mapped or migrated. +The resulting sequences of tokens are merged together to form +the final migrated code. Such divide-and-conquer and syntax-direction strategies enable phrase-based SMT to adapt well to +syntactical structures in source code, thus, improving migration +accuracy. Our empirical evaluation on several real-world systems +shows that 84.8–97.9% and 70–83% of the migrated methods are +syntactically and semantically correct, respectively. 26.3–51.2% +of total migrated methods are exactly matched to the human-written C# code in the oracle. Compared to Java2CSharp, a rule-based migration tool, it achieves higher semantic accuracy from +6.6–57.7% relatively. Importantly, it does not require manual +labeling for training data or manual definition of rules.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Graph-based Statistical Language Model for Code

+

Anh Tuan Nguyen, Tien N. Nguyen. ICSE 2015

+

+ + + +
+ + representation + + language model + + autocomplete + +

+

n-gram statistical language model has been successfully applied to capture programming patterns to support code +completion and suggestion. However, the approaches using n-gram face challenges in capturing the patterns at higher levels +of abstraction due to the mismatch between the sequence nature +in n-grams and the structure nature of syntax and semantics +in source code. This paper presents GraLan, a graph-based +statistical language model and its application in code suggestion. GraLan can learn from a source code corpus and compute +the appearance probabilities of any graphs given the observed +(sub)graphs. We use GraLan to develop an API suggestion +engine and an AST-based language model, ASTLan. ASTLan +supports the suggestion of the next valid syntactic template +and the detection of common syntactic templates. Our empirical +evaluation on a large corpus of open-source projects has shown +that our engine is more accurate in API code suggestion than +the state-of-the-art approaches, and in 75% of the cases, it can +correctly suggest the API with only five candidates. ASTLan also +has high accuracy in suggesting the next syntactic template and +is able to detect many useful and common syntactic templates.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning API Usages from Bytecode: A Statistical Approach

+

Tam The Nguyen, Hung Viet Pham, Phong Minh Vu, Tung Thanh Nguyen. ICSE 2016

+

+ + + +
+ + representation + + API + +

+

Mobile app developers rely heavily on standard API frameworks and libraries. However, learning API usages is often challenging due to the fast-changing nature of API frameworks for mobile systems and the insufficiency of API documentation and source code examples. In this paper, we propose a novel approach to learn API usages from bytecode of Android mobile apps. Our core contributions include HAPI, a statistical model of API usages and three algorithms to extract method call sequences from apps’ bytecode, to train HAPI based on those sequences, and to recommend method calls in code completion using the trained HAPIs. Our empirical evaluation shows that our prototype tool can effectively learn API usages from 200 thousand apps containing 350 million method sequences. It recommends next method calls with top-3 accuracy of 90% and outperforms baseline approaches on average 10-20%.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Mapping API Elements for Code Migration with Vector Representations

+

Trong Duc Nguyen, Anh Tuan Nguyen, Tien N. Nguyen. ICSE 2016

+

+ + + +
+ + migration + + API + +

+

Mapping API elements has a significant role in software development, especially in code migration. A manual process of defining the migration is tedious and error-prone while recent approaches to automatically mine API mappings are limited to discover the mappings with textually similar APIs’ names. This leads to the low accuracy in existing migration tools.We propose an approach to automatically mine API mappings which overcomes the lexical mismatch problem. We represent an API by its usages instead of its name.To characterize an API with its context consisting of surrounding APIs in its usages, we take advantage of Word2Vec model to project the APIs of Java JDK and C# .NET into corresponding continuous vector spaces. The semantic relations among APIs will be observed in those continuous space as the geometric arrangements between their representation vectors in two vector spaces.We use a learning approach to derive the linear (e.g., rotating and scaling) transformation function between two vector spaces. Transformation function is trained from human-defined pairs of API mappings from Java to C#. To find the C# API mapping with a given Java API, we use the learned function to compute its transformed vector in the C# vector space. Then, the C# API which has the most similar vector with the transformed vector is considered as the result. Our experiment shows that for just one suggestion, we are able to correctly derive the API in C# in almost 43% of the cases. With 5 suggestions, we can correctly suggest the correct C# API in almost 3 out of 4 cases (73.2%).

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Exploring API Embedding for API Usages and Applications

+

Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, Tien N. Nguyen. ICSE 2017

+

+ + + +
+ + API + + representation + +

+

Word2Vec is a class of neural network models that +as being trained from a large corpus of texts, they can produce for +each unique word a corresponding vector in a continuous space in +which linguistic contexts of words can be observed. In this work, +we study the characteristics of Word2Vec vectors, called API 2 VEC +or API embeddings, for the API elements within the API sequences in source code. Our empirical study shows that the close +proximity of the API 2 VEC vectors for API elements reflects the +similar usage contexts containing the surrounding APIs of those +API elements. Moreover, API 2 VEC can capture several similar +semantic relations between API elements in API usages via vector +offsets. We demonstrate the usefulness of API 2 VEC vectors for +API elements in three applications. First, we build a tool that mines the pairs of API elements that share the same usage relations +among them. The other applications are in the code migration +domain. We develop API 2 API , a tool to automatically learn the +API mappings between Java and C# using a characteristic of the +API 2 VEC vectors for API elements in the two languages: semantic +relations among API elements in their usages are observed in the +two vector spaces for the two languages as similar geometric +arrangements among their API 2 VEC vectors. Our empirical +evaluation shows that API 2 API relatively improves 22.6% and +40.1% top-1 and top-5 accuracy over a state-of-the-art mining +approach for API mappings. Finally, as another application in +code migration, we are able to migrate equivalent API usages +from Java to C# with up to 90.6% recall and 87.2% precision.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Graph-based Mining of In-the-Wild, Fine-grained, Semantic Code Change Patterns

+

Hoan Anh Nguyen, Tien N. Nguyen, Danny Dig, Son Nguyen, Hieu Tran, and Michael Hilton. ICSE 2019

+

+ + + +
+ + edit + + pattern mining + +

+

Existing approaches for detecting repetitive code changes relying on syntactic similarity cannot effectively detect semantic change patterns. In this work, we introduce a novel graph-based mining approach, CPatMiner, which is capable of detecting semantic code change patterns from a large number of open-source repositories by capturing dependencies between fine-grained change elements. We evaluated CPatMiner by mining change patterns in a diverse corpus of 5,000+ open-source projects from GitHub with 170,000+ developers. We use three complementary methods. First, we sent the mined patterns to the authors and received 108 responses. 70% of respondents recognized those patterns as their meaningful frequent changes. 79% of respondents even named the patterns, and 44% wanted IDEs to automate such repetitive changes. The mined patterns belong to various activities: adaptive (9%), perfective (20%), corrective (35%) and preventive (36%). Second, we compared CPatMiner with the state-of-the-art, AST-based technique, and reported that CPatMiner detects 2.1x more meaningful patterns. Third, we used CPatMiner to search for patterns in a corpus of 88 GitHub projects with longer histories consisting of 164M SLOCs. It constructed 322K fine-grained change graphs containing 3M nodes, and detected 17K change patterns which provide unique insights on the practice of change patterns among individuals and teams. We found that a large percentage (75%) of the patterns from individual developers are commonly shared with others, and this holds true for teams. Moreover, we found that the patterns spread widely over time. Thus, we call for a community-based change pattern database to provide important resources in novel applications.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Suggesting Natural Method Names to Check Name Consistencies

+

Son Nguyen, Hung Phan, Trinh Le, Tien N. Nguyen. ICSE 2020

+

+ + [Preprint] + + + +
+ + naming + +

+

Misleading names of the methods in a project or the APIs in a software library confuse developers about program functionality +and API usages, leading to API misuses and defects. In this paper,we introduce MNire, a machine learning approach to check the +consistency between the name of a given method and its implementation. MNire first generates a candidate name and compares the +current name against it. If the two names are sufficiently similar, we consider the method as consistent. To generate the method name, +we draw our ideas and intuition from an empirical study on the nature of method names in a large dataset. Our key finding is that +high proportions of the tokens of method names can be found in the three contexts of a given method including its body, +the interface (the method’s parameter types and return type), and the enclosing class’ name. Even when such tokens are not there, +MNire uses the contexts to predict the tokens due to the high likelihoods of their co-occurrences. Our unique idea is to treat +the name generation as an abstract summarization on the tokens collected from the names of the program entities in the three +above contexts.

+ +

We conducted several experiments to evaluate MNire in method name consistency checking and in method name +recommending on large datasets with +14M methods. In detecting inconsistency method names, MNire improves the state-of-the-art +approach by 10.4% and 11% relatively in recall and precision, respectively. In method name recommendation, MNire improves relatively +over the state-of-the-art technique, code2vec, in both recall (18.2% higher) and precision (11.1% higher). To assess MNire’s usefulness, +we used it to detect inconsistent methods and suggest new names in several active, GitHub projects. We made 50 pull requests (PRs) and received +42 responses. Among them, five PRs were merged into the main branch, and 13 were approved for later merging. In total, in 31/42 cases, +the developer teams agree that our suggested names are more meaningful than the current names, showing MNire’s usefulness.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Impact of Evaluation Methodologies on Code Summarization

+

Pengyu Nie, Jiyang Zhang, Junyi Jessy Li, Raymond J. Mooney, Milos Gligoric. ACL 2021

+

+ + [ArXiV] + + + +
+ + evaluation + + dataset + +

+

There has been a growing interest in developing machine learning (ML) models for code summarization tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and test sets, were not well studied. Specifically, no prior work on code summarization considered the timestamps of code and comments during evaluation. This may lead to evaluations that are inconsistent with the intended use cases. In this paper, we introduce the time-segmented evaluation methodology, which is novel to the code summarization research community, and compare it with the mixed-project and cross-project methodologies that have been commonly used. Each methodology can be mapped to some use cases, and the time-segmented methodology should be adopted in the evaluation of ML models for code summarization. To assess the impact of methodologies, we collect a dataset of (code, comment) pairs with timestamps to train and evaluate several recent ML models for code summarization. Our experiments show that different methodologies lead to conflicting evaluation results. We invite the community to expand the set of methodologies used in evaluations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Conversational Paradigm for Program Synthesis

+

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, Caiming Xiong. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + synthesis + +

+

Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI’s Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

+

Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, Yingbo Zhou. 2023

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly.

+ +

In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a “free lunch” hypothesis. For data distributions, the effect of a mixture distribution of programming and natural languages on model performance is explored.

+ +

We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into four lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen2

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DIRECT : A Transformer-based Model for Decompiled Identifier Renaming

+

Vikram Nitin, Anthony Saieva, Baishakhi Ray, Gail Kaiser. NLP4Prog 2021

+

+ + [PDF] + + + +
+ + Transformer + + decompilation + +

+

Decompiling binary executables to high-level code is an important step in reverse engineering scenarios, such as malware analysis and legacy code maintenance. However, the generated high-level code is difficult to understand since the original variable names are lost. In this paper, we leverage transformer models to reconstruct the original variable names from decompiled code. Inherent differences between code and natural language present certain challenges in applying conventional transformer-based architectures to variable name recovery. We propose DIRECT, a novel transformer-based architecture customized specifically for the task at hand. We evaluate our model on a dataset of decompiled functions and find that DIRECT outperforms the previous state-of-the-art model by up to 20%. We also present ablation studies evaluating the impact of each of our modifications. We make the source code of DIRECT available to encourage reproducible research.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

+

Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, Bin Luo. ICSE 2022

+

+ + [ArXiV] + + [code] + + + +
+ + Transformer + + representation + +

+

Recent years have seen the successful application of large pre-trained modelsto code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding theirapplication to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pre-training tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existingpre-training tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstreamt asks. Experimental results demonstrate that SPT-Code achieves state-of-the-artperformance on five code-related downstream tasks after fine-tuning.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Program Synthesis with Large Language Models

+

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton. 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + synthesis + +

+

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model’s ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model’s initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Show Your Work: Scratchpads for Intermediate Computation with Language Models

+

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena. 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + execution + +

+

Large pre-trained language models perform remarkably well on tasks that can be done “in one pass”, such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations – even in the few-shot regime – when asked to perform the operation “step by step”, showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a “scratchpad”. On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation

+

Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura. ASE 2015

+

+ + + +
+ + representation + + bimodal + + grammar + +

+

Pseudo-code written in natural language can aid +the comprehension of source code in unfamiliar programming +languages. However, the great majority of source code has no +corresponding pseudo-code, because pseudo-code is redundant +and laborious to create. If pseudo-code could be generated +automatically and instantly from given source code, we could +allow for on-demand production of pseudo-code without human +effort. In this paper, we propose a method to automatically +generate pseudo-code from source code, specifically adopting the +statistical machine translation (SMT) framework. SMT, which +was originally designed to translate between two natural languages, allows us to automatically learn the relationship between +source code/pseudo-code pairs, making it possible to create a +pseudo-code generator with less human effort. In experiments, +we generated English or Japanese pseudo-code from Python +statements using SMT, and find that the generated pseudo-code +is largely accurate, and aids code understanding.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning a Strategy for Adapting a Program Analysis via Bayesian Optimisation

+

Hakjoo Oh, Hongseok Yang, Kwangkeun Yi.. OOPSLA 2015

+

+ + + +
+ + program analysis + +

+

Building a cost-effective static analyser for real-world programs is still regarded an art. One key contributor to this +grim reputation is the difficulty in balancing the cost and the +precision of an analyser. An ideal analyser should be adap- +tive to a given analysis task, and avoid using techniques that +unnecessarily improve precision and increase analysis cost. +However, achieving this ideal is highly nontrivial, and it requires a large amount of engineering efforts.

+ +

In this paper we present a new approach for building +an adaptive static analyser. In our approach, the analyser +includes a sophisticated parameterised strategy that decides, for each part of a given program, whether to apply +a precision-improving technique to that part or not. We +present a method for learning a good parameter for such +a strategy from an existing codebase via Bayesian optimisation. The learnt strategy is then used for new, unseen programs. Using our approach, we developed partially flow- +and context-sensitive variants of a realistic C static analyser. +The experimental results demonstrate that using Bayesian +optimisation is crucial for learning from an existing codebase. Also, they show that among all program queries that +require flow- or context-sensitivity, our partially flow- and +context-sensitive analysis answers the 75% of them, while +increasing the analysis cost only by 3.3x of the baseline +flow- and context-insensitive analysis, rather than 40x or +more of the fully sensitive version.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Demystifying GPT Self-Repair for Code Generation

+

Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, Armando Solar-Lezama. 2023

+

+ + [ArXiV] + + + +
+ + repair + +

+

Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair – in which the model debugs and fixes mistakes in its own code – has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4’s ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Structured Statistical Syntax Tree Prediction

+

Cyrus Omar. SPLASH 2013

+

+ + + +
+ + language model + + grammar + +

+

Statistical models of source code can be used to improve +code completion systems, assistive interfaces, and code +compression engines. We are developing a statistical model +where programs are represented as syntax trees, rather than +simply a stream of tokens. Our model, initially for the Java +language, combines corpus data with information about syntax, types and the program context. We tested this model +using open source code corpuses and find that our model +is significantly more accurate than the current state of the +art, providing initial evidence for our claim that combining +structural and statistical information is a fruitful strategy.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation

+

Gabriel Orlanski, Alex Gittens. NLP4Prog 2021

+

+ + [PDF] + + + +
+ + dataset + + Transformer + +

+

Answering a programming question with only its title is difficult as salient contextual information is left out. To address this, we present a corpus of over 40,000 StackOverflow question texts to be used in conjunction with the corresponding intents from the CoNaLa dataset (Yin et al., 2018). Using both the intent and the question body, we use BART to establish a baseline BLEU score of 34.35 for this new task. We then find further improvements of 2.8% by combining the mined CoNaLa data with the labeled data to achieve a 35.32 BLEU score. We then evaluate the prior state-of-the-art CoNaLa models with this additional data. We find that our proposed method of using the body and mined data beats that of the previous state-of-the-art by a 71.96% BLEU score. Finally, we perform ablations that prove that BART is an unsupervised multimodal learner and examine its extractive behavior.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Deep Learning Approach to Identifying Source Code in Images and Video

+

Jordan Ott, Abigail Atchison, Paul Harnack, Adrienne Bergh, Erik Linstead.. MSR 2018

+

+ + + +
+ + information extraction + +

+

While substantial progress has been made in mining code on an +Internet scale, efforts to date have been overwhelmingly focused on +data sets where source code is represented natively as text. Large +volumes of source code available online and embedded in technical +videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing +approaches to code extraction and indexing in this environment rely +heavily on computationally intense optical character recognition. +To improve the ease and efficiency of identifying this embedded +code, as well as identifying similar code examples, we develop a +deep learning solution based on convolutional neural networks and +autoencoders. Focusing on Java for proof of concept, our technique +is able to identify the presence of typeset and handwritten source +code in thousands of video images with 85.6%-98.6% accuracy based +on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides +a more scalable basis for video indexing that can be incorporated +into existing software search and mining tools.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints

+

Irene Vlassi Pandi, Earl T. Barr, Andrew D. Gordon, Charles Sutton. 2020

+

+ + [ArXiV] + + + +
+ + types + + bimodal + +

+

We present a new approach to the type inference problem for dynamic languages. Our goal is to combine logical constraints, that is, deterministic information from a type system, with natural constraints, uncertain information about types from sources like identifier names. To this end, we introduce a framework for probabilistic type inference that combines logic and learning: logical constraints on the types are extracted from the program, and deep learning is applied to predict types from surface-level code properties that are statistically associated, such as variable names. The main insight of our method is to constrain the predictions from the learning procedure to respect the logical constraints, which we achieve by relaxing the logical inference problem of type prediction into a continuous optimisation problem. To evaluate the idea, we built a tool called OptTyper to predict a TypeScript declaration file for a JavaScript library. OptTyper combines a continuous interpretation of logical constraints derived by a simple program transformation and static analysis of the JavaScript code, with natural constraints obtained from a deep learning model, which learns naming conventions for types from a large codebase. We evaluate OptTyper on a data set of 5,800 open-source JavaScript projects that have type annotations in the well-known DefinitelyTyped repository. We find that combining logical and natural constraints yields a large improvement in performance over either kind of information individually, and produces 50% fewer incorrect type predictions than previous approaches.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Associating Natural Language Comment and Source Code Entities

+

Sheena Panthaplackel, Milos Gligoric, Raymond J. Mooney, Junyi Jessy Li. AAAI 2020

+

+ + [ArXiV] + + + +
+ + dataset + + bimodal + +

+

Comments are an integral part of software development; they are natural language descriptions associated with source code elements. Understanding explicit associations can be useful in improving code comprehensibility and maintaining the consistency between code and comments. As an initial step towards this larger goal, we address the task of associating entities in Javadoc comments with elements in Java source code. We propose an approach for automatically extracting supervised data using revision histories of open source projects and present a manually annotated evaluation dataset for this task. We develop a binary classifier and a sequence labeling model by crafting a rich feature set which encompasses various aspects of code, comments, and the relationships between them. Experiments show that our systems outperform several baselines learning from the proposed supervision.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Copy that! Editing Sequences by Copying Spans

+

Sheena Panthaplackel, Miltiadis Allamanis, Marc Brockschmidt. 2020

+

+ + [ArXiV] + + + +
+ + edit + +

+

Neural sequence-to-sequence models are finding increasing use in editing of documents, for example in correcting a text document or repairing source code. In this paper, we argue that common seq2seq models (with a facility to copy single tokens) are not a natural fit for such tasks, as they have to explicitly copy each unchanged token. We present an extension of seq2seq models capable of copying entire spans of the input to the output in one step, greatly reducing the number of decisions required during inference. This extension means that there are now many ways of generating the same output, which we handle by deriving a new objective for training and a variation of beam search for inference that explicitly handle this problem.

+ +

In our experiments on a range of editing tasks of natural language and source code, we show that our new model consistently outperforms simpler baselines.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Just-In-Time Inconsistency Detection Between Comments and Source Code

+

Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, Raymond J. Mooney. 2020

+

+ + [ArXiV] + + + +
+ + edit + + bimodal + + documentation + +

+

Natural language comments convey key aspects of source code such as implementation, usage, and pre- and post-conditions. Failure to update comments accordingly when the corresponding code is modified introduces inconsistencies, which is known to lead to confusion and software bugs. In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code, in order to catch potential inconsistencies just-in-time, i.e., before they are committed to a version control system. To achieve this, we develop a deep-learning approach that learns to correlate a comment with code changes. By evaluating on a large corpus of comment/code pairs spanning various comment types, we show that our model outperforms multiple baselines by significant margins. For extrinsic evaluation, we show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system which can both detect and resolve inconsistent comments based on code changes.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Update Natural Language Comments Based on Code Changes

+

Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Raymond J. Mooney, Junyi Jessy Li. ACL 2020

+

+ + [ArXiV] + + + +
+ + bimodal + + edit + + documentation + +

+

We formulate the novel task of automatically updating an existing natural language comment based on changes in the body of code it accompanies. We propose an approach that learns to correlate changes across two distinct language representations, to generate a sequence of edits that are applied to the existing comment to reflect the source code modifications. We train and evaluate our model using a dataset that we collected from commit histories of open-source software projects, with each example consisting of a concurrent update to a method and its corresponding comment. We compare our approach against multiple baselines using both automatic metrics and human evaluation. Results reflect the challenge of this task and that our model outperforms baselines with respect to making edits.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Describe Solutions for Bug Reports Based on Developer Discussions

+

Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, Raymond J. Mooney. 2021

+

+ + [ArXiV] + + + +
+ + summarization + + documentation + +

+

When a software bug is reported, developers engage in a discussion to collaboratively resolve it. While the solution is likely formulated within the discussion, it is often buried in a large amount of text, making it difficult to comprehend, which delays its implementation. To expedite bug resolution, we propose generating a concise natural language description of the solution by synthesizing relevant content within the discussion, which encompasses both natural language and source code. Furthermore, to support generating an informative description during an ongoing discussion, we propose a secondary task of determining when sufficient context about the solution emerges in real-time. We construct a dataset for these tasks with a novel technique for obtaining noisy supervision from repository changes linked to bug reports. We establish baselines for generating solution descriptions, and develop a classifier which makes a prediction following each new utterance on whether or not the necessary context for performing generation is available. Through automated and human evaluation, we find these tasks to form an ideal testbed for complex reasoning in long, bimodal dialogue context.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Using Developer Discussions to Guide Fixing Bugs in Software

+

Sheena Panthaplackel, Milos Gligoric, Junyi Jessy Li, Raymond J. Mooney. EMNLP 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + repair + +

+

Automatically fixing software bugs is a challenging task. While recent work showed that natural language context is useful in guiding bug-fixing models, the approach required prompting developers to provide this context, which was simulated through commit messages written after the bug-fixing code changes were made. We instead propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for any additional information from developers. For this, we augment standard bug-fixing datasets with bug report discussions. Using these newly compiled datasets, we demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers

+

Emanuele Parisi, Francesco Barchi, Andrea Bartolini, Giuseppe Tagliavini, Andrea Acquaviva. DATE 2021

+

+ + [IEEE] + + [ArXiV] + + + +
+ + optimization + + program analysis + +

+

The analysis of source code through machine learning techniques is an increasingly explored research topic aiming at increasing smartness in the software toolchain to exploit modern architectures in the best possible way. In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption. Depending on the kernel to be executed, the energy optimal scaling configuration is not trivial. While recent work has focused on general-purpose systems to learn and predict the best execution target in terms of the execution time of a snippet of code or kernel (e.g. offload OpenCL kernel on multicore CPU or GPU), in this work we focus on static compile-time features to assess if they can be successfully used to predict the minimum energy configuration on PULP, an ultra-low-power architecture featuring an on-chip cluster of RISC-V processors. Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Making the Most of Scarce Input Data in Deep Learning-Based Source Code Classification for Heterogeneous Device Mapping

+

Emanuele Parisi, Francesco Barchi, Andrea Bartolini, Andrea Acquaviva. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 2022

+

+ + [IEEE] + + [code] + + + +
+ + optimization + + program analysis + + static analysis + + language model + +

+

Despite its relatively recent history, deep learning (DL)-based source code analysis is already a cornerstone in machine learning for compiler optimization. When applied to the classification of pieces of code to identify the best computational unit in a heterogeneous Systems-on-Chip, it can be effective in supporting decisions that a programmer has otherwise to take manually. Several techniques have been proposed exploiting different networks and input information, prominently sequence-based and graph-based representations, complemented by auxiliary information typically related to payload and device configuration. While the accuracy of DL methods strongly depends on the training and test datasets, so far no exhaustive and statistically meaningful analysis has been done on its impact on the results and on how to effectively extract the available information. This is relevant also considering the scarce availability of source code datasets that can be labeled by profiling on heterogeneous compute units. In this article, we first present such a study, which leads us to devise the contribution of code sequences and auxiliary inputs separately. Starting from this analysis, we then demonstrate that by using the normalization of auxiliary information, it is possible to improve state-of-the-art results in terms of accuracy. Finally, we propose a novel approach exploiting Siamese networks that further improve mapping accuracy by increasing the cardinality of the dataset, thus compensating for its relatively small size.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Building Language Models for Text with Named Entities

+

M.R. Parvez, Saikat Chakraborty, Baishakhi Ray, KW Chang. ACL 2018

+

+ + + +
+ + language model + +

+

Text in many domains involves a significant amount of named entities. Predicting the entity names is often challenging +for a language model as they appear less +frequent on the training corpus. In this +paper, we propose a novel and effective +approach to building a discriminative language model which can learn the entity +names by leveraging their entity type information. We also introduce two benchmark datasets based on recipes and Java +programming codes, on which we evaluate the proposed model. Experimental results show that our model achieves 52.2% +better perplexity in recipe generation and +22.06% on code generation than the state-of-the-art language models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Retrieval Augmented Code Generation and Summarization

+

Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. EMNLP-Findings 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + summarization + + code generation + +

+

Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers’ code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. REDCODER has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation

+

Pardis Pashakhanloo, Aaditya Naik, Yuepeng Wang, Hanjun Dai, Petros Maniatis, Mayur Naik. ICLR 2022

+

+ + [OpenReview] + + + +
+ + representation + + variable misuse + +

+

Designing a suitable representation for code-reasoning tasks is challenging in aspects such as the kinds of program information to model, how to combine them, and how much context to consider. We propose CodeTrek, a deep learning approach that addresses these challenges by representing codebases as databases that conform to rich relational schemas. The relational representation not only allows CodeTrek to uniformly represent diverse kinds of program information, but also to leverage program-analysis queries to derive new semantic relations, which can be readily incorporated without further architectural engineering. CodeTrek embeds this relational representation using a set of walks that can traverse different relations in an unconstrained fashion, and incorporates all relevant attributes along the way. We evaluate CodeTrek on four diverse and challenging Python tasks: variable misuse, exception prediction, unused definition, and variable shadowing. +CodeTrek achieves an accuracy of 91%, 63%, 98%, and 94% on these tasks respectively, and outperforms state-of-the-art neural models by 2-19% points.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Exploring Dimensions of Generalizability and Few-shot Transfer for Text-to-SQL Semantic Parsing

+

Rajaswa Patil, Manasi Patwardhan, Shirish Karande, Lovekesh Vig, Gautam Shroff. The 1st Transfer Learning for Natural Language Processing Workshop (TL4NLP 2022) 2022

+

+ + [PDF] + + [Data] + + + +
+ + dataset + + evaluation + + Transformer + + benchmark + + generalizability + +

+

Existing work on generalization in Text-to-SQL semantic parsing has been restricted to a zero-shot cross-domain setting. In this paper, we introduce Spider-Gen: a Text-to-SQL benchmark to develop a paradigm of transfer learning across distinct dimensions of generalization in Text-to-SQL semantic parsing. The Spider-Gen benchmark focuses on few-shot adaption for Cross-domain, Lexical, and Structural generalization of Text-to-SQL models. Through our experiments with the Spider-Gen dataset, we show that Seq2Seq language models struggle to generalize against change in data distribution, lexical changes in database schema, and changes in SQL query complexity. Our experiments also reveal that performing few-shot fine-tuning helps Text-to-SQL models to generalize across these changes. However, such few-shot adaptation comes with a negative effect on the knowledge learnt during training. Hence, we also explore Parameter-efficient Fine-tuning methods to overcome the limitations of Seq2Seq Text-to-SQL models. We release the Spider-Gen dataset publicly to facilitate further research in generalization and transfer learning across various dimensions in Text-to-SQL semantic parsing.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Fuzz: Application-Independent Fuzz Testing with Probabilistic, Generative Models of Input Data

+

Jibesh Patra, Michael Pradel. 2016

+

+ + + +
+ + fuzzing + +

+

Fuzzing is a popular technique to create test inputs for software that processes structured data. It has been successfully +applied in various domains, ranging from compilers and interpreters over program analyses to rendering engines, image manipulation tools, and word processors. Existing fuzz +testing techniques are tailored for a particular purpose and +rely on a carefully crafted model of the data to be generated. +This paper presents TreeFuzz, a generic approach for generating structured data without an a priori known model. The +key idea is to exploit a given corpus of example data to au- +tomatically infer probabilistic, generative models that create +new data with properties similar to the corpus. To support a +wide range of different properties, TreeFuzz is designed as a +framework with an extensible set of techniques to infer generative models. We apply the idea to JavaScript programs +and HTML documents and show that the approach generates mostly valid data for both of them: 96.3% of the generated JavaScript programs are syntactically valid and there are +only 2.06 validation errors per kilobyte of generated HTML. +The performance of both learning and generation scales linearly w.r.t. the size of the corpus. Using TreeFuzz-generated +JavaScript programs for differential testing of JavaScript engines exposes various inconsistencies among browsers, including browser bugs and unimplemented language features.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Semantic Bug Seeding: A Learning-Based Approach for Creating Realistic Bugs

+

Jibesh Patra, Michael Pradel. FSE 2021

+

+ + + +
+ + repair + + edit + +

+

When working on techniques to address the wide-spread problem +of software bugs, one often faces the need for a large number of +realistic bugs in real-world programs. Such bugs can either help +evaluate an approach, e.g., in form of a bug benchmark or a suite +of program mutations, or even help build the technique, e.g., in +learning-based bug detection. Because gathering a large number ofreal bugs is difficult, +a common approach is to rely on automatically +seeded bugs. Prior work seeds bugs based on syntactic transformation patterns, +which often results in unrealistic bugs and typically +cannot introduce new, application-specific code tokens. This paper +presents SemSeed, a technique for automatically seeding bugs in +a semantics-aware way. The key idea is to imitate how a given +real-world bug would look like in other programs by semantically +adapting the bug pattern to the local context. To reason about the +semantics of pieces of code, our approach builds on learned token embeddings +that encode the semantic similarities of identifiers and literals. Our +evaluation with real-world JavaScript softwares +hows that the approach effectively reproduces real bugs and clearly +outperforms a semantics-unaware approach. The seeded bugs are +useful as training data for learning-based bug detection, where +they significantly improve the bug detection ability. Moreover, we +show that SemSeed-created bugs complement existing mutation +testing operators, and that our approach is efficient enough to seed +hundreds of thousands of bugs within an hour.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions

+

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri. 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + language model + +

+

There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described `AI pair programmer’, GitHub Copilot, a language model trained over open-source GitHub code. However, code often contains bugs - and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot’s code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis we prompt Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those from MITRE’s “Top 25” list). We explore Copilot’s performance on three distinct code generation axes – examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,692 programs. Of these, we found approximately 40% to be vulnerable.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

How could Neural Networks understand Programs?

+

Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, Tie-Yan Liu. ICML 2021

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

Semantic understanding of programs is a fundamental problem for programming language processing (PLP). Recent works that learn representations of code based on pre-training techniques in NLP have pushed the frontiers in this direction. However, the semantics of PL and NL have essential differences. These being ignored, we believe it is difficult to build a model to better understand programs, by either directly applying off-the-shelf NLP pre-training techniques to the source code, or adding features to the model by the heuristic. In fact, the semantics of a program can be rigorously defined by formal semantics in PL theory. For example, the operational semantics, describes the meaning of a valid program as updating the environment (i.e., the memory address-value function) through fundamental operations, such as memory I/O and conditional branching. Inspired by this, we propose a novel program semantics learning paradigm, that the model should learn from information composed of (1) the representations which align well with the fundamental operations in operational semantics, and (2) the information of environment transition, which is indispensable for program understanding. To validate our proposal, we present a hierarchical Transformer-based pre-training model called OSCAR to better facilitate the understanding of programs. OSCAR learns from intermediate representation (IR) and an encoded representation derived from static analysis, which are used for representing the fundamental operations and approximating the environment transitions respectively. OSCAR empirically shows the outstanding capability of program semantics understanding on many practical software engineering tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Generative Type Inference for Python

+

Yun Peng, Chaozheng Wang, Wenxuan Wang, Cuiyun Gao, Michael R. Lyu. 2023

+

+ + [ArXiV] + + + +
+ + types + +

+

Python is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems. Supervised type inference approaches, while feature-agnostic, require large, high-quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem. However, their performance is limited. This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypeGen constructs example prompts from human annotations. TypeGen only requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, TypeGen enhances the interpretability of results through the use of the input-explanation-output strategy. Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match by using only five examples. Furthermore, TypeGen achieves substantial improvements of 27% to 84% compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-1 Exact Match.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CoTexT: Multi-task Learning with Code-Text Transformer

+

Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, Yanfang Ye. NLP4Prog 2021

+

+ + [ArXiV] + + [PDF] + + + +
+ + Transformer + +

+

We present CoTexT, a transformer-based architecture encoder-decoder pre-trained model that learns the representative context between natural language (NL) and programming language (PL) through multi-task learning. CoTexT is pre-trained, in self-supervised fashion, based on large programming language corpus to learn general-purpose understanding and code-text generation supporting downstream NL-PL task such as code summarizing/documentation, code generation, defect detection, code debugging, etc. We train CoTexT on different combination of available PL corpus including both “bimodal” and “unimodal” data where the former is the combinations of both natural texts and their corresponding code snippets in an input sequence and the latter is merely code snippets. We evaluate multi-task learning CoTexT on different generation and classification tasks on CodeXGLUE and it achieves state-of-the-art on all downstream tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Program Embeddings to Propagate Feedback on Student Code

+

Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, Leonidas Guibas. ICML 2015

+

+ + + +
+ + representation + + repair + + education + +

+

Providing feedback, both assessing final work +and giving hints to stuck students, is difficult +for open-ended assignments in massive online +classes which can range from thousands to millions of students. We introduce a neural network +method to encode programs as a linear mapping +from an embedded precondition space to an embedded postcondition space and propose an algorithm for feedback at scale using these linear maps as features. We apply our algorithm +to assessments from the Code.org Hour of Code +and Stanford University’s CS1 course, where we +propagate human comments on student assignments to orders of magnitude more submissions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Synchromesh: Reliable code generation from pre-trained language models

+

Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, Sumit Gulwani. ICLR 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + language model + +

+

Large pre-trained language models have been used to generate code,providing a flexible interface for synthesizing programs from natural language specifications. However, they often violate syntactic and semantic rules of their output language, limiting their practical usability. In this paper, we propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation. Synchromesh comprises two components. First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection. TST learns to recognize utterances that describe similar target programs despite differences in surface natural language features. Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD): a general framework for constraining the output to a set of valid programs in the target language. CSD leverages constraints on partial outputs to sample complete correct programs, and needs neither re-training nor fine-tuning of the language model. We evaluate our methods by synthesizing code from natural language descriptions using GPT-3 and Codex in three real-world languages: SQL queries, Vega-Lite visualizations and SMCalFlow programs. These domains showcase rich constraints that CSD is able to enforce, including syntax, scope, typing rules, and contextual logic. We observe substantial complementary gains from CSD and TST in prediction accuracy and in effectively preventing run-time errors.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Time-Efficient Code Completion Model for the R Programming Language

+

Artem Popov, Dmitrii Orekhov, Denis Litvinov, Nikolay Korolev, Gleb Morgachev. NLP4Prog 2021

+

+ + [PDF] + + + +
+ + dataset + + language model + + code generation + + Transformer + +

+

In this paper we present a deep learning code completion model for the R language. We introduce several techniques to utilize language modeling based architecture in the code completion task. With these techniques, the model requires low resources, but still achieves high quality. We also present an evaluation dataset for the R language completion task. Our dataset contains multiple autocompletion usage contexts that provides robust validation results. The dataset is publicly available.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Learning to Find Bugs

+

Michael Pradel, Koushik Sen. 2017

+

+ + [PDF] + + + +
+ + defect + + program analysis + +

+

Automated bug detection, e.g., through pattern-based static +analysis, is an increasingly popular technique to find programming errors and other code quality issues. Traditionally, +bug detectors are program analyses that are manually written and carefully tuned by an analysis expert. Unfortunately, +the huge amount of possible bug patterns makes it difficult +to cover more than a small fraction of all bugs. This paper +presents a new approach toward creating bug detectors. The +basic idea is to replace manually writing a program analysis +with training a machine learning model that distinguishes +buggy from non-buggy code. To address the challenge that +effective learning requires both positive and negative train- +ing examples, we use simple code transformations that create likely incorrect code from existing code examples. We +present a general framework, called DeepBugs, that extracts +positive training examples from a code corpus, leverages +simple program transformations to create negative training +examples, trains a model to distinguish these two, and then +uses the trained model for identifying programming mistakes in previously unseen code. As a proof of concept, we +create four bug detectors for JavaScript that find a diverse set +of programming mistakes, e.g., accidentally swapped function arguments, incorrect assignments, and incorrect binary +operations. To find bugs, the trained models use information +that is usually discarded by program analyses, such as identifier names of variables and functions. Applying the approach +to a corpus of 150,000 JavaScript files shows that learned bug +detectors have a high accuracy, are very efficient, and reveal +132 programming mistakes in real-world code.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

TypeWriter: Neural Type Prediction with Search-based Validation

+

Michael Pradel, Georgios Gousios, Jason Liu, Satish Chandra.. 2019

+

+ + [ArXiV] + + + +
+ + types + + bimodal + +

+

Maintaining large code bases written in dynamically typed languages, such as JavaScript or Python, can be challenging: simple data compatibility errors proliferate, IDE support is lacking and APIs are harder to comprehend. Recent work attempts to address those issues through either static analysis or probabilistic type inference. Unfortunately, static type inference for dynamic languages is inherently limited, while probabilistic approaches suffer from imprecision. This paper presents TypeWriter, the first combination of probabilistic prediction with search-based refinement of predicted types. TypeWriter’s predictor learns to infer the return and argument types for functions from partially annotated code bases by combining the natural language properties of code with programming language-level information. To validate predicted types, TypeWriter invokes a gradual type checker with different combinations of the predicted types, while navigating the space of possible type combinations in a feedback-directed manner. We implement the TypeWriter approach for Python and evaluate it on two code corpora: a multi-million line code base at Facebook and a collection of 500 popular open-source projects. We show that TypeWriter’s type predictor achieves a precision of 64% (91%) and a recall of 52% (68%) in the top-1 (top-5) predictions, and demonstrate that usage contexts are a helpful addition to neural type predictors. By combining predictions with search-based validation, TypeWriter can fully annotate between 42% to 64% of the files in a randomly selected corpus, while ensuring type correctness. A comparison with a static type inference tool shows that TypeWriter adds many more non-trivial types. Overall, TypeWriter provides developers with an effective way to help with the transition to fully type-annotated code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Software Analysis

+

Michael Pradel, Satish Chandra. 2020

+

+ + [ArXiV] + + + +
+ + program analysis + + survey + +

+

Many software development problems can be addressed by program analysis tools, which traditionally are based on precise, logical reasoning and heuristics to ensure that the tools are practical. Recent work has shown tremendous success through an alternative way of creating developer tools, which we call neural software analysis. The key idea is to train a neural machine learning model on numerous code examples, which, once trained, makes predictions about previously unseen code. In contrast to traditional program analysis, neural software analysis naturally handles fuzzy information, such as coding conventions and natural language embedded in code, without relying on manually encoded heuristics. This article gives an overview of neural software analysis, discusses when to (not) use it, and presents three example analyses. The analyses address challenging software development problems: bug detection, type prediction, and code completion. The resulting tools complement and outperform traditional program analyses, and are used in industrial practice.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Unsupervised Learning of General-Purpose Embeddings for Code Changes

+

Mikhail Pravilov, Egor Bogomolov, Yaroslav Golubev, Timofey Bryksin. 2021

+

+ + [ArXiV] + + + +
+ + edit + + representation + +

+

Applying machine learning to tasks that operate with code changes requires their numerical representation. In this work, we propose an approach for obtaining such representations during pre-training and evaluate them on two different downstream tasks - applying changes to code and commit message generation. During pre-training, the model learns to apply the given code change in a correct way. This task requires only code changes themselves, which makes it unsupervised. In the task of applying code changes, our model outperforms baseline models by 5.9 percentage points in accuracy. As for the commit message generation, our model demonstrated the same results as supervised models trained for this specific task, which indicates that it can encode code changes well and can be improved in the future by pre-training on a larger dataset of easily gathered code changes.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Intelligent Code Completion with Bayesian Networks

+

Sebastian Proksch, Johannes Lerch, Mira Mezini. TSE 2015

+

+ + + +
+ + autocomplete + +

+

Code completion is an integral part of modern Integrated Development Environments (IDEs). Developers +often use it to explore Application Programming Interfaces (APIs). It is also useful to reduce the required +amount of typing and to help avoid typos. Traditional code completion systems propose all type-correct +methods to the developer. Such a list is often very long with many irrelevant items. More intelligent code +completion systems have been proposed in prior work to reduce the list of proposed methods to relevant +items.

+ +

This work extends one of these existing approaches, the Best Matching Neighbor (BMN) algorithm. We +introduce Bayesian networks as an alternative underlying model, use additional context information for +more precise recommendations, and apply clustering techniques to improve model sizes. We compare our +new approach, Pattern-based Bayesian Networks (PBN), to the existing BMN algorithm. We extend previously used evaluation methodologies and, in addition to prediction quality, we also evaluate model size and +inference speed.

+ +

Our results show that the additional context information we collect improves prediction quality, especially +for queries that do not contain method calls. We also show that PBN can obtain comparable prediction +quality to BMN, while model size and inference speed scale better with large input sizes.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

sk_p: a neural program corrector for MOOCs

+

Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, Regina Barzilay. SPLASH 2016

+

+ + + +
+ + repair + +

+

We present a novel technique for automatic program correction in MOOCs, capable of fixing both syntactic and semantic errors without manual, problem specific correction strategies. Given an incorrect student program, it generates candidate programs from a distribution of likely corrections, and checks each candidate for correctness against a test suite.

+ +

The key observation is that in MOOCs many programs share similar code fragments, and the seq2seq neural network model, used in the natural-language processing task of machine translation, can be modified and trained to recover these fragments.

+ +

Experiment shows our scheme can correct 29% of all incorrect submissions and out-performs state of the art approach which requires manual, problem specific correction strategies.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

+

Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladmir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Ulrich Finkler. 2021

+

+ + [GitHub] + + + +
+ + dataset + +

+

Advancements in deep learning and machine learning algorithms have enabled +breakthrough progress in computer vision, speech recognition, natural language +processing and beyond. In addition, over the last several decades, software has +been built into the fabric of every aspect of our society. Together, these two +trends have generated new interest in the fast-emerging research area of “AI for +Code”. As software development becomes ubiquitous across all industries and code +infrastructure of enterprise legacy applications ages, it is more critical than ever +to increase software development productivity and modernize legacy applications. +Over the last decade, datasets like ImageNet, with its large scale and diversity, +have played a pivotal role in algorithmic advancements from computer vision to +language and speech understanding. In this paper, we present “Project CodeNet”, +a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate +the algorithmic advancements in AI for Code. It consists of 14M code samples +and about 500M lines of code in 55 different programming languages. Project +CodeNet is not only unique in its scale, but also in the diversity of coding tasks +it can help benchmark: from code similarity and classification for advances in +code recommendation algorithms, and code translation between a large variety +programming languages, to advances in code performance (both runtime, and +memory) improvement techniques. CodeNet also provides sample input and output +test sets for over 7M code samples, which can be critical for determining code +equivalence in different languages. As a usability feature, we provide several +preprocessing tools in Project CodeNet to transform source codes into representations +that can be readily used as inputs into machine learning models.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Testing Neural Program Analyzers

+

Md Rafiqul Islam Rabin, Ke Wang, Mohammad Amin Alipour. ASE (LBR-Track) 2019

+

+ + [ArXiV] + + [code] + + + +
+ + evaluation + + refactoring + +

+

Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts to develop effective techniques for testing neural programs. We discuss the challenges involved in developing such tools and our future plans. In our preliminary experiment on a neural model recently proposed in the literature, we found that the model is very brittle, and simple perturbations in the input can cause the model to make mistakes in its prediction.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Towards Demystifying Dimensions of Source Code Embeddings

+

Md Rafiqul Islam Rabin, Arjun Mukherjee, Omprakash Gnawali, Mohammad Amin Alipour. RL+SE&PL (Co-located with ESEC/FSE) 2020

+

+ + [ArXiV] + + [code] + + + +
+ + evaluation + + representation + + naming + + interpretability + +

+

Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations

+

Md Rafiqul Islam Rabin, Nghi D. Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, Mohammad Amin Alipour. IST 2021

+

+ + [ArXiV] + + [code] + + + +
+ + evaluation + + adversarial + + generalizability + + refactoring + + summarization + +

+

With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees. On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Our results on the generalizability of neural program models provide insights to measure their limitations and provide a stepping stone for their improvement.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Understanding Neural Code Intelligence Through Program Simplification

+

Md Rafiqul Islam Rabin, Vincent J. Hellendoorn, Mohammad Amin Alipour. ESEC/FSE 2021

+

+ + [ArXiV] + + [code] + + + +
+ + interpretability + + refactoring + + information extraction + +

+

A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of “transparent/interpretable-AI”. However, these approaches are often specific to a particular set of network architectures, even requiring access to the network’s parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND’s extracted features may help understand neural CI systems’ predictions and learned behavior.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Memorization and Generalization in Neural Code Intelligence Models

+

Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour, Vincent J. Hellendoorn. IST 2022

+

+ + [ArXiV] + + [code] + + + +
+ + evaluation + + memorization + + generalizability + + refactoring + + language model + +

+

Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset and use various metrics to quantify the impact of noise on various aspects of training and testing. We evaluate several state-of-the-art neural code intelligence models and benchmarks based on Java, Python, and Ruby codebases. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. We observed all models manifest some forms of memorization. This can be potentially troublesome in most code intelligence tasks where they rely on rather noise-prone and repetitive data sources, such as code from GitHub. To the best of our knowledge, we provide the first study to quantify memorization effects in the domain of software engineering and code intelligence systems. This work raises awareness and provides new insights into important issues of training neural models in code intelligence systems that are usually overlooked by software engineering researchers.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models

+

Md Rafiqul Islam Rabin, Aftab Hussain, Mohammad Amin Alipour. MAPS 2022

+

+ + [ArXiV] + + [code] + + + +
+ + interpretability + + refactoring + + adversarial + +

+

Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI models. However, this approach is syntax-unaware and does not consider the grammar of the programming language. In this paper, we apply a syntax-guided program reduction technique that considers the grammar of the input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique is faster and provides smaller sets of key tokens in reduced programs. We also show that the key tokens could be used in generating adversarial examples for up to 65% of the input programs.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Abstract Syntax Networks for Code Generation and Semantic Parsing

+

Maxim Rabinovich, Mitchell Stern, Dan Klein. ACL 2017

+

+ + [ArXiV] + + + +
+ + code generation + + grammar + +

+

Tasks like code generation and semantic parsing require mapping unstructured (or partially structured) inputs to well-formed, executable outputs. We introduce abstract syntax networks, a modeling framework for these problems. The outputs are represented as abstract syntax trees (ASTs) and constructed by a decoder with a dynamically-determined modular structure paralleling the structure of the output tree. On the benchmark Hearthstone dataset for code generation, our model obtains 79.2 BLEU and 22.7% exact match accuracy, compared to previous state-of-the-art values of 67.1 and 6.1%. Furthermore, we perform competitively on the Atis, Jobs, and Geo semantic parsing datasets with no task-specific engineering.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

User-guided program reasoning using Bayesian inference

+

Mukund Raghothaman, Sulekha Kulkarni, Kihong Heo, Mayur Naik. PLDI 2018

+

+ + [Paper] + + + +
+ + program analysis + +

+

Program analyses necessarily make approximations that often lead them to report true alarms interspersed with many false alarms. We propose a new approach to leverage user feedback to guide program analyses towards true alarms and away from false alarms. Our approach associates each alarm with a confidence value by performing Bayesian inference on a probabilistic model derived from the analysis rules. In each iteration, the user inspects the alarm with the highest confidence and labels its ground truth, and the approach recomputes the confidences of the remaining alarms given this feedback. It thereby maximizes the return on the effort by the user in inspecting each alarm. We have implemented our approach in a tool named Bingo for program analyses expressed in Datalog. Experiments with real users and two sophisticated analyses—a static datarace analysis for Java programs and a static taint analysis for Android apps—show significant improvements on a range of metrics, including false alarm rates and number of bugs found.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Natural Software Revisited

+

Musfiqur Rahman, Dharani Palani, Peter C. Rigby. ICSE 2019

+

+ + + +
+ +

+

Recent works have concluded that software is more repetitive and predictable, i.e. more natural, than English texts. These works included “simple/artificial” syntax rules in their language models. When we remove SyntaxTokens we find that code is still repetitive and predictable but only at levels slightly above English. Furthermore, previous works have compared individual Java programs to general English corpora, such as Gutenberg, which contains a historically large range of styles and subjects (e.g. Saint Augustine to Oscar Wilde). We perform an additional comparison of technical StackOverflow English discussions with source code and find that this restricted English is similarly repetitive to code. Although we find that code is less repetitive than previously thought, we suspect that API code element usage will be repetitive across software projects. For example a file is opened and closed in the same manner irrespective of domain. When we restrict our n-grams to those contained in the Java API we find that the entropy is significantly lower than the English corpora. Previous works have focused on sequential sequences of tokens. When we extract program graphs of size 2, 3, and 4 nodes we see that the abstract graph representation is much more concise and repetitive than the sequential representations of the same code. This suggests that future work should focus on statistical graph models that go beyond linear sequences of tokens. Our anonymous replication package makes our scripts and data available to future researchers and reviewers.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Backdoors in Neural Models of Source Code

+

Goutham Ramakrishnan, Aws Albarghouthi. ICPR 2022

+

+ + [IEEE] + + [ArXiV] + + [Code] + + + +
+ + adversarial + +

+

Deep neural networks are vulnerable to a range of adversaries. A particularly pernicious class of vulnerabilities are backdoors, where model predictions diverge in the presence of subtle triggers in inputs. An attacker can implant a backdoor by poisoning the training data to yield a desired target prediction on triggered inputs. We study backdoors in the context of deep-learning for source code. (1) We define a range of backdoor classes for source-code tasks and show how to poison a dataset to install such backdoors. (2) We adapt and improve recent algorithms from robust statistics for our setting, showing that backdoors leave a spectral signature in the learned representation of source code, thus enabling detection of poisoned data. (3) We conduct a thorough evaluation on different architectures and languages, showing the ease of injecting backdoors and our ability to eliminate them.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On the “Naturalness” of Buggy Code

+

Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, Premkumar Devanbu. ICSE 2015

+

+ + + +
+ + defect + +

+

Real software, the kind working programmers produce by the kLOC +to solve real-world problems, tends to be “natural”, like speech or +natural language; it tends to be highly repetitive and predictable. +Researchers have captured this naturalness of software through statistical models and used them to good effect in suggestion engines, +porting tools, coding standards checkers, and idiom miners. This +suggests that code that appears improbable, or surprising, to a good +statistical language model is “unnatural” in some sense, and thus +possibly suspicious. In this paper, we investigate this hypothesis. We consider a large corpus of bug fix commits (ca. 8,296), +from 10 different Java projects, and we focus on its language statistics, evaluating the naturalness of buggy code and the corresponding fixes. We find that code with bugs tends to be more entropic +(i.e. unnatural), becoming less so as bugs are fixed. Focusing on +highly entropic lines is similar in cost-effectiveness to some well-known static bug finders (PMD, FindBugs) and ordering warnings +from these bug finders using an entropy measure improves the cost-effectiveness of inspecting code implicated in warnings. This suggests that entropy may be a valid language-independent and simple +way to complement the effectiveness of PMD or FindBugs, and +that search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code Completion with Statistical Language Models

+

Veselin Raychev, Martin Vechev, Eran Yahav. PLDI 2014

+

+ + + +
+ + language model + + autocomplete + + code generation + +

+

We address the problem of synthesizing code completions for programs using APIs. Given a program with holes, we synthesize completions for holes with the most likely sequences of method calls.

+ +

Our main idea is to reduce the problem of code completion to +a natural-language processing problem of predicting probabilities +of sentences. We design a simple and scalable static analysis that +extracts sequences of method calls from a large codebase, and +index these into a statistical language model. We then employ +the language model to find the highest ranked sentences, and use +them to synthesize a code completion. Our approach is able to +synthesize sequences of calls across multiple objects together with +their arguments.

+ +

Experiments show that our approach is fast and effective. Virtually all computed completions typecheck, and the desired completion appears in the top 3 results in 90% of the cases.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Predicting Program Properties from “Big Code”

+

Veselin Raychev, Martin Vechev, Andreas Krause. POPL 2015

+

+ + + +
+ + program analysis + + naming + + types + + deobfuscation + +

+

We present a new approach for predicting program properties from +massive codebases (aka “Big Code”). Our approach first learns a +probabilistic model from existing data and then uses this model to +predict properties of new, unseen programs.

+ +

The key idea of our work is to transform the input program into +a representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic +graphical models such as conditional random fields (CRFs) in order +to perform joint prediction of program properties.

+ +

As an example of our approach, we built a scalable prediction +engine called JSNICE 1 for solving two kinds of problems in the +context of JavaScript: predicting (syntactic) names of identifiers +and predicting (semantic) type annotations of variables. Experimentally, JSNICE predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of the +cases. In the first week since its release, JSN ICE was used by more +than 30,000 developers and in only few months has become a popular tool in the JavaScript developer community.

+ +

By formulating the problem of inferring program properties as +structured prediction and showing how to perform both learning +and inference in this context, our work opens up new possibilities +for attacking a wide range of difficult problems in the context of +“Big Code” including invariant generation, de-compilation, synthesis and others.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Programs from Noisy Data

+

Veselin Raychev, Pavol lBielik, Martin Vechev, Andreas Krause. POPL 2016

+

+ + + +
+ + code generation + + grammar + +

+

We present a new approach for learning programs from noisy +datasets. Our approach is based on two new concepts: a regularized +program generator which produces a candidate program based on a +small sample of the entire dataset while avoiding overfitting, and a +dataset sampler which carefully samples the dataset by leveraging +the candidate program’s score on that dataset. The two components +are connected in a continuous feedback-directed loop.

+ +

We show how to apply this approach to two settings: one where +the dataset has a bound on the noise, and another without a noise +bound. The second setting leads to a new way of performing +approximate empirical risk minimization on hypotheses classes +formed by a discrete search space.

+ +

We then present two new kinds of program synthesizers which +target the two noise settings. First, we introduce a novel regularized +bitstream synthesizer that successfully generates programs even in +the presence of incorrect examples. We show that the synthesizer +can detect errors in the examples while combating overfitting – +a major problem in existing synthesis techniques. We also show +how the approach can be used in a setting where the dataset grows +dynamically via new examples (e.g., provided by a human).

+ +

Second, we present a novel technique for constructing statistical +code completion systems. These are systems trained on massive +datasets of open source programs, also known as “Big Code”. The +key idea is to introduce a domain specific language (DSL) over +trees and to learn functions in that DSL directly from the dataset. +These learned functions then condition the predictions made by the +system. This is a flexible and powerful technique which generalizes +several existing works as we no longer need to decide a priori on +what the prediction should be conditioned (another benefit is that +the learned functions are a natural mechanism for explaining the +prediction). As a result, our code completion system surpasses the +prediction capabilities of existing, hard-wired systems.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Model Editing Processes

+

Machel Reid, Graham Neubig. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + edit + +

+

Most existing sequence generation models produce outputs in one pass, usually left-to-right. However, this is in contrast with a more natural approach that humans use in generating content; iterative refinement and editing. Recent work has introduced edit-based models for various tasks (such as neural machine translation and text style transfer), but these generally model a single edit step. In this work, we propose modeling editing processes, modeling the whole process of iteratively generating sequences. We form a conceptual framework to describe the likelihood of multi-step edits, and describe neural models that can learn a generative model of sequences based on these multistep edits. We introduce baseline results and metrics on this task, finding that modeling editing processes improves performance on a variety of axes on both our proposed task and related downstream tasks compared to previous single-step models of edits.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

+

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, Shuai Ma. 2020

+

+ + [ArXiV] + + + +
+ + evaluation + +

+

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

The Code2Text Challenge: Text Generation in Source Code Libraries

+

Kyle Richardson, Sina Zarrieß, Jonas Kuhn. INLG 2017

+

+ + [ArXiV] + + + +
+ + bimodal + +

+

We propose a new shared task for tactical data-to-text generation in the domain of source code libraries. Specifically, we focus on text generation of function descriptions from example software projects. Data is drawn from existing resources used for studying the related problem of semantic parser induction (Richardson and Kuhn, 2017b; Richardson and Kuhn, 2017a), and spans a wide variety of both natural languages and programming languages. In this paper, we describe these existing resources, which will serve as training and development data for the task, and discuss plans for building new independent test sets.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Function Assistant: A Tool for NL Querying of APIs

+

Kyle Richardson, Jonas Kuhn. EMNLP 2017

+

+ + [ArXiV] + + + +
+ + bimodal + + API + +

+

In this paper, we describe Function Assistant, a lightweight Python-based toolkit for querying and exploring source code repositories using natural language. The toolkit is designed to help end-users of a target API quickly find information about functions through high-level natural language queries and descriptions. For a given text query and background API, the tool finds candidate functions by performing a translation from the text to known representations in the API using the semantic parsing approach of Richardson and Kuhn (2017). Translations are automatically learned from example text-code pairs in example APIs. The toolkit includes features for building translation pipelines and query engines for arbitrary source code projects. To explore this last feature, we perform new experiments on 27 well-known Python projects hosted on Github.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Technical Correspondences in Technical Documentation

+

Kyle Richardson, Jonas Kuhn. ACL 2017

+

+ + [ArXiV] + + + +
+ + documentation + + API + + bimodal + +

+

We consider the problem of translating high-level textual descriptions to formal representations in technical documentation as part of an effort to model the meaning of such documentation. We focus specifically on the problem of learning translational correspondences between text descriptions and grounded representations in the target documentation, such as formal representation of functions or code templates. Our approach exploits the parallel nature of such documentation, or the tight coupling between high-level text and the low-level representations we aim to learn. Data is collected by mining technical documents for such parallel text-representation pairs, which we use to train a simple semantic parsing model. We report new baseline results on sixteen novel datasets, including the standard library documentation for nine popular programming languages across seven natural languages, and a small collection of Unix utility manuals.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Polyglot Semantic Parsing in APIs

+

Kyle Richardson, Jonathan Berant, Jonas Kuhn. NAACL 2018

+

+ + [ArXiV] + + + +
+ + bimodal + + API + +

+

Traditional approaches to semantic parsing (SP) work by training individual models for each available parallel dataset of text-meaning pairs. In this paper, we explore the idea of polyglot semantic translation, or learning semantic parsing models that are trained on multiple datasets and natural languages. In particular, we focus on translating text to code signature representations using the software component datasets of Richardson and Kuhn (2017a,b). The advantage of such models is that they can be used for parsing a wide variety of input natural languages and output programming languages, or mixed input languages, using a single unified model. To facilitate modeling of this type, we develop a novel graph-based decoding framework that achieves state-of-the-art performance on the above datasets, and apply this method to two other benchmark SP tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes

+

Cedric Richter, Heike Wehrheim. 2022

+

+ + [ArXiV] + + [Code] + + + +
+ + Transformer + + repair + + defect + +

+

Real bug fixes found in open source repositories seem to be the perfect source for learning to localize and repair real bugs. However, the absence of large scale bug fix collections has made it difficult to effectively exploit real bug fixes in the training of larger neural models in the past. In contrast, artificial bugs – produced by mutating existing source code – can be easily obtained at a sufficient scale and are therefore often preferred in the training of existing approaches. Still, localization and repair models that are trained on artificial bugs usually underperform when faced with real bugs. This raises the question whether bug localization and repair models trained on real bug fixes are more effective in localizing and repairing real bugs.

+ +

We address this question by introducing RealiT, a pre-train-and-fine-tune approach for effectively learning to localize and repair real bugs from real bug fixes. RealiT is first pre-trained on a large number of artificial bugs produced by traditional mutation operators and then fine-tuned on a smaller set of real bug fixes. Fine-tuning does not require any modifications of the learning algorithm and hence can be easily adopted in various training scenarios for bug localization or repair (even when real training data is scarce). In addition, we found that training on real bug fixes with RealiT is empirically powerful by nearly doubling the localization performance of an existing model on real bugs while maintaining or even improving the repair performance.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

+

Baptiste Roziere, Marie-Anne Lachaux, Marc Szafraniec, Guillaume Lample. 2021

+

+ + [ArXiV] + + + +
+ + pretraining + +

+

Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 13% in unsupervised code translation, and 24% in natural language code search. Incidentally, we found that our pre-trained model is able to de-obfuscate fully obfuscated source files, and to suggest descriptive variable names.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Leveraging Automated Unit Tests for Unsupervised Code Translation

+

Baptiste Roziere, Jie M. Zhang, Francois Charton, Mark Harman, Gabriel Synnaeve, Guillaume Lample. 2021

+

+ + [ArXiV] + + + +
+ + migration + +

+

With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java → Python and Python → C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Automated Vulnerability Detection in Source Code Using Deep Representation Learning

+

Rebecca L. Russell, Louis Kim, Lei H. Hamilton, Tomo Lazovich, Jacob A. Harer, Onur Ozdemir, Paul M. Ellingwood, Marc W. McConley. 2018

+

+ + [ArXiV] + + + +
+ + program analysis + +

+

Increasing numbers of software vulnerabilities are discovered every year whether they are reported publicly or discovered internally in proprietary code. These vulnerabilities can pose serious risk of exploit and result in system compromise, information leaks, or denial of service. We leveraged the wealth of C and C++ open-source code available to develop a large-scale function-level vulnerability detection system using machine learning. To supplement existing labeled vulnerability datasets, we compiled a vast dataset of millions of open-source functions and labeled it with carefully-selected findings from three different static analyzers that indicate potential exploits. Using these datasets, we developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code. We evaluated our tool on code from both real software packages and the NIST SATE IV benchmark dataset. Our results demonstrate that deep feature representation learning on source code is a promising approach for automated software vulnerability detection.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models

+

Iman Saberi, Fateme H. Fard. MSR 2023

+

+ + [ArXiV] + + + +
+ + Transformer + + repair + + summarization + +

+

Pre-trained Programming Language Models (PPLMs) achieved many recent states of the art results for many code-related software engineering tasks. Though some studies use data flow or propose tree-based models that utilize Abstract Syntax Tree (AST), most PPLMs do not fully utilize the rich syntactical information in source code. Still, the input is considered a sequence of tokens. There are two issues; the first is computational inefficiency due to the quadratic relationship between input length and attention complexity. Second, any syntactical information, when needed as an extra input to the current PPLMs, requires the model to be pre-trained from scratch, wasting all the computational resources already used for pre-training the current models. In this work, we propose Named Entity Recognition (NER) adapters, lightweight modules that can be inserted into Transformer blocks to learn type information extracted from the AST. These adapters can be used with current PPLMs such as CodeBERT, GraphCodeBERT, and CodeT5. We train the NER adapters using a novel Token Type Classification objective function (TTC). We insert our proposed work in CodeBERT, building CodeBERTER, and evaluate the performance on two tasks of code refinement and code summarization. CodeBERTER improves the accuracy of code refinement from 16.4 to 17.8 while using 20% of training parameter budget compared to the fully fine-tuning approach, and the BLEU score of code summarization from 14.75 to 15.90 while reducing 77% of training parameters compared to the fully fine-tuning approach.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Answer Semantic Queries over Code

+

Surya Prakash Sahu, Madhurima Mandal, Shikhar Bharadwaj, Aditya Kanade, Petros Maniatis, Shirish Shevade. 2022

+

+ + [ArXiV] + + + +
+ + static analysis + + Transformer + +

+

During software development, developers need answers to queries about semantic aspects of code. Even though extractive question-answering using neural approaches has been studied widely in natural languages, the problem of answering semantic queries over code using neural networks has not yet been explored. This is mainly because there is no existing dataset with extractive question and answer pairs over code involving complex concepts and long chains of reasoning. We bridge this gap by building a new, curated dataset called CodeQueries, and proposing a neural question-answering methodology over code. +We build upon state-of-the-art pre-trained models of code to predict answer and supporting-fact spans. Given a query and code, only some of the code may be relevant to answer the query. We first experiment under an ideal setting where only the relevant code is given to the model and show that our models do well. We then experiment under three pragmatic considerations: (1) scaling to large-size code, (2) learning from a limited number of examples and (3) robustness to minor syntax errors in code. Our results show that while a neural model can be resilient to minor syntax errors in code, increasing size of code, presence of code that is not relevant to the query, and reduced number of training examples limit the model performance. We are releasing our data and models to facilitate future work on the proposed problem of answering semantic queries over code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Oreo: detection of clones in the twilight zone

+

Vaibhav Saini, Farima Farmahinifarahani, Yadong Lu, Pierre Baldi, Cristina Lopes. ESEC/FSE 2018

+

+ + [ArXiV] + + [website] + + [code] + + + +
+ + clone + +

+

Source code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). Most clone detectors reported in the literature work well up to Type-3, which accounts for syntactic differences. In between Type-3 and Type-4, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect – the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. We present Oreo, a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones accurately, but is also capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. We evaluate the recall of Oreo on BigCloneBench, and perform manual evaluation for precision. Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Syntax and Sensibility: Using language models to detect and correct syntax errors

+

Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, José Nelson Amaral. SANER 2018

+

+ + [PDF] + + [code] + + + +
+ + repair + + language model + +

+

Syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of experience that help them quickly resolve these frustrating errors. Standard LR parsers are of little help, typically resolving syntax errors and their precise location poorly. We propose a methodology that locates where syntax errors occur, and suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by using language models trained on correct source code to find tokens that seem out of place. Fixes are synthesized by consulting the language models to determine what tokens are more likely at the estimated error location. We compare n-gram and LSTM (long short-term memory) language models for this task, each trained on a large corpus of Java code collected from GitHub. Unlike prior work, our methodology does not rely that the problem source code comes from the same domain as the training data. We evaluated against a repository of real student mistakes. Our tools are able to find a syntactically-valid fix within its top-2 suggestions, often producing the exact fix that the student used to resolve the error. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Products, Developers, and Milestones: How Should I Build My N-Gram Language Model

+

Juliana Saraiva, Christian Bird, Thomas Zimmermann. FSE 2015

+

+ + + +
+ + language model + +

+

Recent work has shown that although programming languages en- +able source code to be rich and complex, most code tends to be +repetitive and predictable. The use of natural language processing +(NLP) techniques applied to source code such as n-gram language +models show great promise in areas such as code completion, aiding impaired developers, and code search. In this paper, we address +three questions related to different methods of constructing lan- +guage models in an industrial context. Specifically, we ask: (1) Do +application specific, but smaller language models perform better +than language models across applications? (2) Are developer specific language models effective and do they differ depending on +what parts of the codebase a developer is working in? (3) Finally, +do language models change over time, i.e., does a language model +from early development model change later on in development? +The answers to these questions enable techniques that make use of +programming language models in development to choose the model +training corpus more effectively.

+ +

We evaluate these questions by building 28 language models across +developers, time periods, and applications within Microsoft Office +and present the results in this paper. We find that developer and +application specific language models perform better than models +from the entire codebase, but that temporality has little to no effect +on language model performance.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

What is it like to program with artificial intelligence?

+

Advait Sarkar, Andrew D. Gordon, Carina Negreanu, Christian Poelitz, Sruti Srinivasa Ragavan, Ben Zorn. 2022

+

+ + [ArXiV] + + + +
+ + human evaluation + + review + +

+

Large language models, such as OpenAI’s codex and Deepmind’s AlphaCode, can generate code to solve a variety of problems expressed in natural language. This technology has already been commercialised in at least one widely-used programming editor extension: GitHub Copilot.

+ +

In this paper, we explore how programming with large language models (LLM-assisted programming) is similar to, and differs from, prior conceptualisations of programmer assistance. We draw upon publicly available experience reports of LLM-assisted programming, as well as prior usability and design studies. We find that while LLM-assisted programming shares some properties of compilation, pair programming, and programming via search and reuse, there are fundamental differences both in the technical possibilities as well as the practical experience. Thus, LLM-assisted programming ought to be viewed as a new way of programming with its own distinct properties and challenges.

+ +

Finally, we draw upon observations from a user study in which non-expert end user programmers use LLM-assisted tools for solving data tasks in spreadsheets. We discuss the issues that might arise, and open research challenges, in applying large language models to end-user programming, particularly with users who have little or no programming expertise.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Inferring Javascript types using Graph Neural Networks

+

Jessica Schrouff, Kai Wohlfahrt, Bruno Marnette, Liam Atkinson. Representation Learning on Graphs and Manifolds ICLR 2019 workshop 2019

+

+ + [ArXiV] + + + +
+ + GNN + + types + + program analysis + +

+

The recent use of `Big Code’ with state-of-the-art deep learning methods offers promising avenues to ease program source code writing and correction. As a first step towards automatic code repair, we implemented a graph neural network model that predicts token types for Javascript programs. The predictions achieve an accuracy above 90%, which improves on previous similar work.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion

+

Roei Schuster, Congzheng Song, Eran Tromer, Vitaly Shmatikov. USENIX Security 2021

+

+ + [ArXiV] + + + +
+ + autocomplete + + adversarial + +

+

Code autocompletion is an integral feature of modern code editors and IDEs. The latest generation of autocompleters uses neural language models, trained on public open-source code repositories, to suggest likely (not just statically feasible) completions given the current context.

+ +

We demonstrate that neural code autocompleters are vulnerable to poisoning attacks. By adding a few specially-crafted files to the autocompleter’s training corpus (data poisoning), or else by directly fine-tuning the autocompleter on these files (model poisoning), the attacker can influence its suggestions for attacker-chosen contexts. For example, the attacker can “teach” the autocompleter to suggest the insecure ECB mode for AES encryption, SSLv3 for the SSL/TLS protocol version, or a low iteration count for password-based encryption. Moreover, we show that these attacks can be targeted: an autocompleter poisoned by a targeted attack is much more likely to suggest the insecure completion for files from a specific repo or specific developer.

+ +

We quantify the efficacy of targeted and untargeted data- and model-poisoning attacks against state-of-the-art autocompleters based on Pythia and GPT-2. We then evaluate existing defenses against poisoning attacks and show that they are largely ineffective.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

NIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model

+

Abhishek Sharma, Yuan Tian, David Lo. SANER 2015

+

+ + + +
+ + information extraction + +

+

Twitter is one of the most widely used social media +platforms today. It enables users to share and view short 140-character messages called “tweets”. About 284 million active +users generate close to 500 million tweets per day. Such rapid +generation of user generated content in large magnitudes results +in the problem of information overload. Users who are interested +in information related to a particular domain have limited means +to filter out irrelevant tweets and tend to get lost in the huge +amount of data they encounter. A recent study by Singer et +al. found that software developers use Twitter to stay aware of +industry trends, to learn from others, and to network with other +developers. However, Singer et al. also reported that developers +often find Twitter streams to contain too much noise which is a +barrier to the adoption of Twitter. In this paper, to help developers +cope with noise, we propose a novel approach named NIRMAL, +which automatically identifies software relevant tweets from a +collection or stream of tweets. Our approach is based on language +modeling which learns a statistical model based on a training +corpus (i.e., set of documents). We make use of a subset of posts +from StackOverflow, a programming question and answer site, as +a training corpus to learn a language model. A corpus of tweets +was then used to test the effectiveness of the trained language +model. The tweets were sorted based on the rank the model +assigned to each of the individual tweets. The top 200 tweets +were then manually analyzed to verify whether they are software +related or not, and then an accuracy score was calculated. The +results show that decent accuracy scores can be achieved by +various variants of NIRMAL, which indicates that NIRMAL can +effectively identify software related tweets from a huge corpus of +tweets.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On the Feasibility of Transfer-learning Code Smells using Deep Learning

+

Tushar Sharma, Vasiliki Efstathiou, Panos Louridas, Diomidis Spinellis. 2019

+

+ + [ArXiV] + + + +
+ + representation + + program analysis + +

+

Context: A substantial amount of work has been done to detect smells in source code using metrics-based and heuristics-based methods. Machine learning methods have been recently applied to detect source code smells; however, the current practices are considered far from mature.

+ +

Objective: First, explore the feasibility of applying deep learning models to detect smells without extensive feature engineering, just by feeding the source code in tokenized form. Second, investigate the possibility of applying transfer-learning in the context of deep learning models for smell detection.

+ +

Method: We use existing metric-based state-of-the-art methods for detecting three implementation smells and one design smell in C# code. Using these results as the annotated gold standard, we train smell detection models on three different deep learning architectures. These architectures use Convolution Neural Networks (CNNs) of one or two dimensions, or Recurrent Neural Networks (RNNs) as their principal hidden layers. For the first objective of our study, we perform training and evaluation on C# samples, whereas for the second objective, we train the models from C# code and evaluate the models over Java code samples. We perform the experiments with various combinations of hyper-parameters for each model.

+ +

Results: We find it feasible to detect smells using deep learning methods. Our comparative experiments find that there is no clearly superior method between CNN-1D and CNN-2D. We also observe that performance of the deep learning models is smell-specific. Our transfer-learning experiments show that transfer-learning is definitely feasible for implementation smells with performance comparable to that of direct-learning. This work opens up a new paradigm to detect code smells by transfer-learning especially for the programming languages where the comprehensive code smell detection tools are not available.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

An Exploratory Study on Code Attention in BERT

+

Rishab Sharma, Fuxiang Chen, Fatemeh H. Fard, David Lo. ICPC 2022

+

+ + [ArXiV] + + [code] + + + +
+ + Transformer + + representation + + language model + + interpretability + + pretraining + + clone + +

+

Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled. Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers’ embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21–24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition

+

Rishab Sharma, Fuxiang Chen, Fatemeh H. Fard. ICPC 2022

+

+ + [ArXiV] + + [code] + + + +
+ + summarization + + documentation + + language model + + types + + representation + +

+

Code comment generation is the task of generating a high-level natural language description for a given code method/function. Although researchers have been studying multiple ways to generate code comments automatically, previous work mainly considers representing a code token in its entirety semantics form only (e.g., a language model is used to learn the semantics of a code token), and additional code properties such as the tree structure of a code are included as an auxiliary input to the model. There are two limitations: 1) Learning the code token in its entirety form may not be able to capture information succinctly in source code, and 2)The code token does not contain additional syntactic information, inherently important in programming languages. In this paper, we present LAnguage Model and Named Entity Recognition (LAMNER), a code comment generator capable of encoding code constructs effectively and capturing the structural property of a code token. A character-level language model is used to learn the semantic representation to encode a code token. For the structural property of a token, a Named Entity Recognition model is trained to learn the different types of code tokens. These representations are then fed into an encoder-decoder architecture to generate code comments. We evaluate the generated comments from LAMNER and other baselines on a popular Java dataset with four commonly used metrics. Our results show that LAMNER is effective and improves over the best baseline model in BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR, and CIDEr by 14.34%, 18.98%, 21.55%, 23.00%, 10.52%, 1.44%, and 25.86%, respectively. Additionally, we fused LAMNER’s code representation with the baseline models, and the fused models consistently showed improvement over the nonfused models. The human evaluation further shows that LAMNER produces high-quality code comments.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

NEUZZ: Efficient Fuzzing with Neural Program Smoothing

+

Dongdong She, Kexin Pei, Dave Epstein, Junfeng Yang, Baishakhi Ray, Suman Jana. IEEE S&P 2019

+

+ + [Code] + + + +
+ + fuzzing + +

+

Fuzzing has become the de facto standard technique for finding software vulnerabilities. However, even state-of-the-art fuzzers are not very efficient at finding hard-to-trigger software bugs. Most popular fuzzers use evolutionary guidance to generate inputs that can trigger different bugs. Such evolutionary algorithms, while fast and simple to implement, often get stuck in fruitless sequences of random mutations. Gradient-guided optimization presents a promising alternative to evolutionary guidance. Gradient-guided techniques have been shown to significantly outperform evolutionary algorithms at solving high-dimensional structured optimization problems in domains like machine learning by efficiently utilizing gradients or higher-order derivatives of the underlying function. However, gradient-guided approaches are not directly applicable to fuzzing as real-world program behaviors contain many discontinuities, plateaus, and ridges where the gradient-based methods often get stuck. We observe that this problem can be addressed by creating a smooth surrogate function approximating the discrete branching behavior of target program. In this paper, we propose a novel program smoothing technique using surrogate neural network models that can incrementally learn smooth approximations of a complex, real-world program’s branching behaviors. We further demonstrate that such neural network models can be used together with gradient-guided input generation schemes to significantly improve the fuzzing efficiency. Our extensive evaluations demonstrate that NEUZZ significantly outperforms 10 state-of-the-art graybox fuzzers on 10 real-world programs both at finding new bugs and achieving higher edge coverage. NEUZZ found 31 unknown bugs that other fuzzers failed to find in 10 real world programs and achieved 3X more edge coverage than all of the tested graybox fuzzers for 24 hours running.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Execution through Neural Code Fusion

+

Zhan Shi, Kevin Swersky, Daniel Tarlow, Parthasarathy Ranganathan, Milad Hashemi. 2019

+

+ + [ArXiV] + + + +
+ + representation + +

+

As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn representations of source code, these representations do not understand how code dynamically executes. In this work, we propose a new approach to use GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and complex data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related task (algorithm classification).

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CV4Code: Sourcecode Understanding via Visual Code Representations

+

Ruibo Shi, Lili Tao, Rohan Saphal, Fran Silavong, Sean J. Moran. 2022

+

+ + [ArXiV] + + + +
+ + code similarity + + Transformer + +

+

We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Automatic Source Code Summarization with Extended Tree-LSTM

+

Yusuke Shido, Yasuaki Kobayashi, Akihiro Yamamoto, Atsushi Miyamoto, Tadayuki Matsumura. International Joint Conference on Neural Networks 2019

+

+ + [ArXiV] + + [Dataset] + + [code] + + + +
+ + summarization + + grammar + +

+

Neural machine translation models are used to automatically generate a document from given source code since this can be regarded as a machine translation task. Source code summarization is one of the components for automatic document generation, which generates a summary in natural language from given source code. This suggests that techniques used in neural machine translation, such as Long Short-Term Memory (LSTM), can be used for source code summarization. However, there is a considerable difference between source code and natural language: Source code is essentially structured, having loops and conditional branching, etc. Therefore, there is some obstacle to apply known machine translation models to source code.Abstract syntax trees (ASTs) capture these structural properties and play an important role in recent machine learning studies on source code. Tree-LSTM is proposed as a generalization of LSTMs for tree-structured data. However, there is a critical issue when applying it to ASTs: It cannot handle a tree that contains nodes having an arbitrary number of children and their order simultaneously, which ASTs generally have such nodes. To address this issue, we propose an extension of Tree-LSTM, which we call Multi-way Tree-LSTM and apply it for source code summarization. As a result of computational experiments, our proposal achieved better results when compared with several state-of-the-art techniques.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Evaluation of Type Inference with Textual Cues

+

Amirreza A. Shirani, A. Pastor Lopez-Monroy, Fabio Gonzalez, Thamar Solorio, Mohammad Amin Alipour. NLSE 2018

+

+ + [PDF] + + + +
+ + information extraction + +

+

Type information plays an important role in the success of information retrieval and recommendation systems in software +engineering. Thus, the absence of types in dynamically-typed +languages poses a challenge to adapt these systems to support +dynamic languages.

+ +

In this paper, we explore the viability of type inference using +textual cues. That is, we formulate the type inference problem as a classification problem which uses the textual features +in the source code to predict the type of variables. In this +approach, a classifier learns a model to distinguish between +types of variables in a program. The model is subsequently +used to (approximately) infer the types of other variables.

+ +

We evaluate the feasibility of this approach on four Java +projects wherein type information is already available in the +source code and can be used to train and test a classifier. Our +experiments show this approach can predict the type of new +variables with relatively high accuracy (80% F-measure). +These results suggest that textual cues can be +complementary +tools in inferring types for dynamic languages.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On-the-Fly Adaptation of Source Code Models using Meta-Learning

+

Disha Shrivastava, Hugo Larochelle, Daniel Tarlow. 2020

+

+ + [ArXiV] + + [Code] + + + +
+ + language model + + autocomplete + +

+

The ability to adapt to unseen, local contexts is an important challenge that successful models of source code must overcome. One of the most popular approaches for the adaptation of such models is dynamic evaluation. With dynamic evaluation, when running a model on an unseen file, the model is updated immediately after having observed each token in that file. In this work, we propose instead to frame the problem of context adaptation as a meta-learning problem. We aim to train a base source code model that is best able to learn from information in a file to deliver improved predictions of missing tokens. Unlike dynamic evaluation, this formulation allows us to select more targeted information (support tokens) for adaptation, that is both before and after a target hole in a file. We consider an evaluation setting that we call line-level maintenance, designed to reflect the downstream task of code auto-completion in an IDE. Leveraging recent developments in meta-learning such as first-order MAML and Reptile, we demonstrate improved performance in experiments on a large scale Java GitHub corpus, compared to other adaptation baselines including dynamic evaluation. Moreover, our analysis shows that, compared to a non-adaptive baseline, our approach improves performance on identifiers and literals by 44% and 15%, respectively.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Repository-Level Prompt Generation for Large Language Models of Code

+

Disha Shrivastava, Hugo Larochelle, Daniel Tarlow. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + code completion + +

+

With the success of large language models (LLMs) of code and their use as code assistants (e.g. Codex used in GitHub Copilot), techniques for introducing domain-specific knowledge in the prompt design process become important. In this work, we propose a framework called Repo-Level Prompt Generator that learns to generate example-specific prompts using a set of rules. These rules take context from the entire repository, thereby incorporating both the structure of the repository and the context from other relevant files (e.g. imports, parent class files). Our technique doesn’t require any access to the weights of the LLM, making it applicable in cases where we only have black-box access to the LLM. We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives. We demonstrate that an oracle constructed from our proposed rules gives up to 36% relative improvement over Codex, showing the quality of the rules. Further, we show that when we train a model to select the best rule, we can achieve significant performance gains over Codex. The code for our work can be found at: https://github.com/shrivastavadisha/repo_level_prompt_generation .

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

RepoFusion: Training Code Models to Understand Your Repository

+

Disha Shrivastava, Denis Kocetkov, Harm de Vries, Dzmitry Bahdanau, Torsten Scholak. 2023

+

+ + [ArXiV] + + + +
+ + completion + +

+

Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ($\sim73\times$ larger) and closely match the performance of the $\sim 70\times$ larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at \url{https://huggingface.co/RepoFusion}.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Improving Code Search with Co-Attentive Representation Learning

+

Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, Yan Lei. ICPC 2020

+

+ + [ACM] + + + +
+ + search + +

+

Searching and reusing existing code from a large-scale codebase, e.g, GitHub, can help developers complete a programming task efficiently. Recently, Gu et al. proposed a deep learning-based model (i.e., DeepCS), which significantly outperformed prior models. The DeepCS embedded codebase and natural language queries into vectors by two LSTM (long and short-term memory) models separately, and returned developers the code with higher similarity to a code search query. However, such embedding method learned two isolated representations for code and query but ignored their internal semantic correlations. As a result, the learned isolated representations of code and query may limit the effectiveness of code search.

+ +

To address the aforementioned issue, we propose a co-attentive representation learning model, i.e., Co-Attentive Representation Learning Code Search-CNN (CARLCS-CNN). CARLCS-CNN learns interdependent representations for the embedded code and query with a co-attention mechanism. Generally, such mechanism learns a correlation matrix between embedded code and query, and co-attends their semantic relationship via row/column-wise max-pooling. In this way, the semantic correlation between code and query can directly affect their individual representations. We evaluate the effectiveness of CARLCS-CNN on Gu et al.’s dataset with 10k queries. Experimental results show that the proposed CARLCS-CNN model significantly outperforms DeepCS by 26.72% in terms of MRR (mean reciprocal rank). Additionally, CARLCS-CNN is five times faster than DeepCS in model training and four times in testing.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Loop Invariants for Program Verification

+

Xujie Si, Hanjun Dai, Mukund Raghothaman, Mayur Naik, Le Song. NeurIPS 2018

+

+ + [Preprint] + + + +
+ + program analysis + + verification + +

+

A fundamental problem in program verification concerns inferring loop invariants. +The problem is undecidable and even practical instances are challenging. Inspired +by how human experts construct loop invariants, we propose a reasoning framework +CODE2INV +that constructs the solution by multi-step decision making and querying +an external program graph memory block. By training with reinforcement learning, +CODE2INV +captures rich program features and avoids the need for ground truth +solutions as supervision. Compared to previous learning tasks in domains with +graph-structured data, it addresses unique challenges, such as a binary objective +function and an extremely sparse reward that is given by an automated theorem +prover only after the complete loop invariant is proposed. We evaluate +CODE2INV on +a suite of 133 benchmark problems and compare it to three state-of-the-art systems. +It solves 106 problems compared to 73 by a stochastic search-based system, 77 by +a heuristic search-based system, and 100 by a decision tree learning-based system. +Moreover, the strategy learned can be generalized to new programs: compared to +solving new instances from scratch, the pre-trained agent is more sample efficient +in finding solutions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Senatus - A Fast and Accurate Code-to-Code Recommendation Engine

+

Fran Silavong, Sean Moran, Antonios Georgiadis, Rohan Saphal, Robert Otter. MSR 2022

+

+ + [ArXiV] + + + +
+ + code similarity + + search + +

+

Machine learning on source code (MLOnCode) is a popular research field that has been driven by the availability of large-scale code repositories and the development of powerful probabilistic and deep learning models for mining source code. Code-to-code recommendation is a task in MLOnCode that aims to recommend relevant, diverse and concise code snippets that usefully extend the code currently being written by a developer in their development environment (IDE). Code-to-code recommendation engines hold the promise of increasing developer productivity by reducing context switching from the IDE and increasing code-reuse. Existing code-to-code recommendation engines do not scale gracefully to large codebases, exhibiting a linear growth in query time as the code repository increases in size. In addition, existing code-to-code recommendation engines fail to account for the global statistics of code repositories in the ranking function, such as the distribution of code snippet lengths, leading to sub-optimal retrieval results. We address both of these weaknesses with Senatus, a new code-to-code recommendation engine. At the core of Senatus is De-Skew LSH a new locality sensitive hashing (LSH) algorithm that indexes the data for fast (sub-linear time) retrieval while also counteracting the skewness in the snippet length distribution using novel abstract syntax tree-based feature scoring and selection algorithms. We evaluate Senatus and find the recommendations to be of higher quality than competing baselines, while achieving faster search. For example on the CodeSearchNet dataset Senatus improves performance by 31.21% F1 and 147.9x faster query time compared to Facebook Aroma. Senatus also outperforms standard MinHash LSH by 29.2% F1 and 51.02x faster query time.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair

+

André Silva, Sen Fang, Martin Monperrus. 2023

+

+ + [ArXiV] + + + +
+ + repair + +

+

Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tunes LLMs with naive code representations and is fundamentally limited in its ability to fine-tune larger LLMs. To address this problem, we propose RepairLLaMA, a novel program repair approach that combines 1) code representations for APR and 2) the state-of-the-art parameter-efficient LLM fine-tuning technique called LoRA. This results in RepairLLaMA producing a highly effective `program repair adapter’ for fixing bugs with language models. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals. Second, parameter-efficient fine-tuning helps fine-tuning to converge and contributes to the effectiveness of the repair adapter to fix data-points outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 125 Defects4J v2 and 82 HumanEval-Java bugs, outperforming all baselines.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Question Independent Grading using Machine Learning: The Case of Computer Program Grading

+

Gursimran Singh, Shashank Srikant, Varun Aggarwal. KDD 2016

+

+ + [PDF] + + [website] + + + +
+ + education + +

+

Learning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this work, this issue is amplified. The models have to be trained not only for every question but also for every language the question is offered in. Moreover, the availability and time taken by experts to create a labeled set of programs for each question is a major bottleneck in scaling such a system. We address this issue by presenting a method to grade computer programs which requires no manually assigned labeled samples for grading responses to a new, unseen question. We extend our previous work (by Srikant, Aggarwal; KDD 2014) wherein we introduced a grammar of features to learn question specific models. In this work, we propose a method to transform those features into a set of features that maintain their structural relation with the labels across questions. Using these features we learn one supervised model, across questions for a given language, which can then be applied to an ungraded response to an unseen question. We show that our method rivals the performance of both, question specific models and the consensus among human experts while substantially outperforming extant ways of evaluating codes. We demonstrate the system single s value by deploying it to grade programs in a high stakes assessment. The learning from this work is transferable to other grading tasks such as math question grading and also provides a new variation to the supervised learning approach.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CORE: Automating Review Recommendation for Code Changes

+

JingKai Siow, Cuiyun Gao, Lingling Fan, Sen Chen, Yang Liu. SANER 2019

+

+ + [ArXiV] + + + +
+ + review + +

+

Code review is a common process that is used by developers, in which a reviewer provides useful comments or points out defects in the submitted source code changes via pull request. Code review has been widely used for both industry and open-source projects due to its capacity in early defect identification, project maintenance, and code improvement. With rapid updates on project developments, code review becomes a non-trivial and labor-intensive task for reviewers. Thus, an automated code review engine can be beneficial and useful for project development in practice. Although there exist prior studies on automating the code review process by adopting static analysis tools or deep learning techniques, they often require external sources such as partial or full source code for accurate review suggestion. In this paper, we aim at automating the code review process only based on code changes and the corresponding reviews but with better performance. The hinge of accurate code review suggestion is to learn good representations for both code changes and reviews. To achieve this with limited source, we design a multi-level embedding (i.e., word embedding and character embedding) approach to represent the semantics provided by code changes and reviews. The embeddings are then well trained through a proposed attentional deep learning model, as a whole named CORE. We evaluate the effectiveness of CORE on code changes and reviews collected from 19 popular Java projects hosted on Github. Experimental results show that our model CORE can achieve significantly better performance than the state-of-the-art model (DeepMem), with an increase of 131.03% in terms of Recall@10 and 150.69% in terms of Mean Reciprocal Rank. Qualitative general word analysis among project developers also demonstrates the performance of CORE in automating code review.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Program Semantics with Code Representations: An Empirical Study

+

Jing Kai Siow, Shangqing Liu, Xiaofei Xie, Guozhu Meng, Yang Liu. SANER 2022

+

+ + [ArXiV] + + + +
+ + representation + +

+

Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-art performance. However, currently, a comprehensive and systematic study on evaluating different program representation techniques across diverse tasks is still missed.

+ +

From this starting point, in this paper, we conduct an empirical study to evaluate different program representation techniques. Specifically, we categorize current mainstream code representation techniques into four categories i.e., Feature-based, Sequence-based, Tree-based, and Graph-based program representation technique and evaluate its performance on three diverse and popular code intelligent tasks i.e., {Code Classification}, Vulnerability Detection, and Clone Detection on the public released benchmark. We further design three {research questions (RQs)} and conduct a comprehensive analysis to investigate the performance. By the extensive experimental results, we conclude that (1) The graph-based representation is superior to the other selected techniques across these tasks. (2) Compared with the node type information used in tree-based and graph-based representations, the node textual information is more critical to learning the program semantics. (3) Different tasks require the task-specific semantics to achieve their highest performance, however combining various program semantics from different dimensions such as control dependency, data dependency can still produce promising results.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Mining Idioms in the Wild

+

Aishwarya Sivaraman, Rui Abreu, Andrew Scott, Tobi Akomolede, Satish Chandra. 2021

+

+ + [ArXiV] + + + +
+ + pattern mining + + refactoring + +

+

Existing code repositories contain numerous instances of code patterns that are idiomatic ways of accomplishing a particular programming task. Sometimes, the programming language in use supports specific operators or APIs that can express the same idiomatic imperative code much more succinctly. However, those code patterns linger in repositories because the developers may be unaware of the new APIs or have not gotten around to them. Detection of idiomatic code can also point to the need for new APIs.

+ +

We share our experiences in mine idiomatic patterns from the Hack repo at Facebook. We found that existing techniques either cannot identify meaningful patterns from syntax trees or require test-suite-based dynamic analysis to incorporate semantic properties to mine useful patterns. The key insight of the approach proposed in this paper – Jezero – is that semantic idioms from a large codebase can be learned from canonicalized dataflow trees. We propose a scalable, lightweight static analysis-based approach to construct such a tree that is well suited to mine semantic idioms using nonparametric Bayesian methods.

+ +

Our experiments with Jezero on Hack code shows a clear advantage of adding canonicalized dataflow information to ASTs: Jezero was significantly more effective than a baseline that did not have the dataflow augmentation in being able to effectively find refactoring opportunities from unannotated legacy code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

LExecutor: Learning-Guided Execution

+

Beatriz Souza, Michael Pradel. 2023

+

+ + [ArXiV] + + [Code] + + + +
+ + execution + +

+

Executing code is essential for various program analysis tasks, e.g., to detect bugs that manifest through exceptions or to obtain execution traces for further dynamic analysis. However, executing an arbitrary piece of code is often difficult in practice, e.g., because of missing variable definitions, missing user inputs, and missing third-party dependencies. This paper presents LExecutor, a learning-guided approach for executing arbitrary code snippets in an underconstrained way. The key idea is to let a neural model predict missing values that otherwise would cause the program to get stuck, and to inject these values into the execution. For example, LExecutor injects likely values for otherwise undefined variables and likely return values of calls to otherwise missing functions. We evaluate the approach on Python code from popular open-source projects and on code snippets extracted from Stack Overflow. The neural model predicts realistic values with an accuracy between 80.1% and 94.2%, allowing LExecutor to closely mimic real executions. As a result, the approach successfully executes significantly more code than any available technique, such as simply executing the code as-is. For example, executing the open-source code snippets as-is covers only 4.1% of all lines, because the code crashes early on, whereas LExecutor achieves a coverage of 50.1%.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code

+

Egor Spirin, Egor Bogomolov, Vladimir Kovalenko, Timofey Bryksin. MSR 2021

+

+ + [ArXiV] + + [website] + + [code] + + + +
+ + tool + +

+

The application of machine learning algorithms to source code has grown in the past years. Since these algorithms are quite sensitive to input data, it is not surprising that researchers experiment with input representations. Nowadays, a popular starting point to represent code is abstract syntax trees (ASTs). Abstract syntax trees have been used for a long time in various software engineering domains, and in particular in IDEs. The API of modern IDEs allows to manipulate and traverse ASTs, resolve references between code elements, etc. Such algorithms can enrich ASTs with new data and therefore may be useful in ML-based code analysis. In this work, we present PSIMINER— a tool for processing PSI trees from the IntelliJ Platform. PSI trees contain code syntax trees as well as functions to work with them, and therefore can be used to enrich code representation using static analysis algorithms of modern IDEs. To showcase this idea, we use our tool to infer types of identifiers in Java ASTs and extend the code2seq model for the method name prediction problem.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A system to grade computer programming skills using machine learning

+

Shashank Srikant, Varun Aggarwal. KDD 2014

+

+ + [PDF] + + [website] + + + +
+ + education + +

+

The automatic evaluation of computer programs is a nascent area of research with a potential for large-scale impact. Extant program assessment systems score mostly based on the number of test-cases passed, providing no insight into the competency of the programmer. In this paper, we present a system to grade computer programs automatically. In addition to grading a program on its programming practices and complexity, the key kernel of the system is a machine-learning based algorithm which determines closeness of the logic of the given program to a correct program. This algorithm uses a set of highly-informative features, derived from the abstract representations of a given program, that capture the program’s functionality. These features are then used to learn a model to grade the programs, which are built against evaluations done by experts. We show that the regression models provide much better grading than the ubiquitous test-case-pass based grading and rivals the grading accuracy of other open-response problems such as essay grading . We also show that our novel features add significant value over and above basic keyword/expression count features. In addition to this, we propose a novel way of posing computer-program grading as a one-class modeling problem and report encouraging preliminary results. We show the value of the system through a case study in a real-world industrial deployment. To the best of the authors’ knowledge, this is the first time a system using machine learning has been developed and used for grading programs. The work is timely with regard to the recent boom in Massively Online Open Courseware (MOOCs), which promises to produce a significant amount of hand-graded digitized data.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Grammar-Based Structural CNN Decoder for Code Generation

+

Zeyu Sun, Qihao Zhu, Lili Mou, Yingfei Xiong, Ge Li, Lu Zhang. AAAI 2019

+

+ + + +
+ + code generation + + grammar + +

+

Code generation maps a program description to executable +source code in a programming language. Existing approaches +mainly rely on a recurrent neural network (RNN) as the decoder. However, we find that a program contains significantly +more tokens than a natural language sentence, and thus it may +be inappropriate for RNN to capture such a long sequence. In +this paper, we propose a grammar-based structural convolutional neural network (CNN) for code generation. Our model +generates a program by predicting the grammar rules of the +programming language; we design several CNN modules, including the tree-based convolution and pre-order convolution, +whose information is further aggregated by dedicated attentive pooling layers. Experimental results on the HearthStone +benchmark dataset show that our CNN code generator significantly outperforms the previous state-of-the-art method by 5 +percentage points; additional experiments on several semantic parsing tasks demonstrate the robustness of our model. We +also conduct in-depth ablation test to better understand each +component of our model.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

PSCS: A Path-based Neural Model for Semantic Code Search

+

Zhensu Sun, Yan Liu, Chen Yang, Yu Qian. 2020

+

+ + [ArXiV] + + + +
+ + grammar + + search + +

+

To obtain code snippets for reuse, programmers prefer to search for related documents, e.g., blogs or Q&A, instead of code itself. The major reason is due to the semantic diversity and mismatch between queries and code snippets. Deep learning models have been proposed to address this challenge. Compared with approaches using information retrieval techniques, deep learning models do not suffer from the information loss caused by refining user intention into keywords. However, the performance of previous works is not satisfactory because they ignore the importance of code structure. When the semantics of code (e.g., identifier names, APIs) are ambiguous, code structure may be the only feature for the model to utilize. In that case, previous works relearn the structural information from lexical tokens of code, which is extremely difficult for a model without any domain knowledge. In this work, we propose PSCS, a path-based neural model for semantic code search. Our model encodes both the semantics and structures of code represented by AST paths. We train and evaluate our model over 330k-19k query-function pairs, respectively. The evaluation results demonstrate that PSCS achieves a SuccessRate of 47.6% and a Mean Reciprocal Rank (MRR) of 30.4% when considering the top-10 results with a match. The proposed approach significantly outperforms both DeepCS, the first approach that applies deep learning to code search task, and CARLCS, a state-of-the-art approach that introduces a co-attentive representation learning model on the basis of DeepCS. The importance of code structure is demonstrated with an ablation study on code features, which enlightens model design for further studies.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Pythia: AI-assisted Code Completion System

+

Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, Neel Sundaresan. KDD 2019

+

+ + + +
+ + autocomplete + + language model + +

+

In this paper, we propose a novel end-to-end approach for AI-assisted code completion called Pythia. It generates ranked lists of method and API recommendations which can be used by software developers at edit time. The system is currently deployed as part of Intellicode extension in Visual Studio Code IDE. Pythia exploits state-of-the-art large-scale deep learning models trained on code contexts extracted from abstract syntax trees. It is designed to work at a high throughput predicting the best matching code completions on the order of 100 ms.

+ +

We describe the architecture of the system, perform comparisons to frequency-based approach and invocation-based Markov Chain language model, and discuss challenges serving Pythia models on lightweight client devices.

+ +

The offline evaluation results obtained on 2700 Python open source software GitHub repositories show a top-5 accuracy of 92%, surpassing the baseline models by 20% averaged over classes, for both intra and cross-project settings.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Fast and Memory-Efficient Neural Code Completion

+

Alexey Svyatkovskiy, Sebastian Lee, Anna Hadjitofi, Maik Riechert, Juliana Franco, Miltiadis Allamanis. 2020

+

+ + [ArXiV] + + + +
+ + autocomplete + +

+

Code completion is one of the most widely used features of modern integrated development environments (IDEs). Deep learning has recently made significant progress in the statistical prediction of source code. However, state-of-the-art neural network models consume prohibitively large amounts of memory, causing computational burden to the development environment, especially when deployed in lightweight client devices.

+ +

In this work, we reframe neural code completion from a generation task to a task of learning to rank the valid completion suggestions computed from static analyses. By doing so, we are able to design and test a variety of deep neural network model configurations. One of our best models consumes 6 MB of RAM, computes a single suggestion in 8 ms, and achieves 90% recall in its top five suggestions. Our models outperform standard language modeling code completion techniques in terms of predictive performance, computational speed, and memory efficiency. Furthermore, they learn about code semantics from the natural language aspects of the code (e.g. identifier names) and can generalize better to previously unseen code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

IntelliCode Compose: Code Generation Using Transformer

+

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, Neel Sundaresan. 2020

+

+ + [ArXiV] + + + +
+ + autocomplete + + code generation + + synthesis + + language model + + pretraining + +

+

In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. +In this paper, we introduce IntelliCode Compose − a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook. +Our best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code Translation with Compiler Representations

+

Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + migration + + decompilation + +

+

In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java - Rust pair. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code and Named Entity Recognition in StackOverflow

+

Jeniya Tabassum, Mounica Maddela, Wei Xu, Alan Ritter. ACL 2020

+

+ + [ArXiV] + + [Code] + + + +
+ + dataset + + information extraction + +

+

There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F-1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F1 score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

LLM4Decompile: Decompiling Binary Code with Large Language Models

+

Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang. 2024

+

+ + [ArXiV] + + [code] + + + +
+ + decompilation + + translation + + evaluation + + large language models + + LLM + +

+

Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at this https URL

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Fix Build Errors with Graph2Diff Neural Networks

+

Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton, Edward Aftandilian. 2019

+

+ + [ArXiV] + + [preprint] + + + +
+ + edit + + repair + +

+

Professional software developers spend a significant amount oftime fixing builds, but this has received little attention as a prob-lem in automatic program repair. We present a new deep learningarchitecture, called Graph2Diff, for automatically localizing andfixing build errors. We represent source code, build configurationfiles, and compiler diagnostic messages as a graph, and then use aGraph Neural Network model to predict a diff. A diff specifies howto modify the code’s abstract syntax tree, represented in the neuralnetwork as a sequence of tokens and of pointers to code locations.Our network is an instance of a more general abstraction which wecall Graph2Tocopo, which is potentially useful in any developmenttool for predicting source code changes. We evaluate the model ona dataset of over 500k real build errors and their resolutions fromprofessional developers. Compared to the approach of DeepDelta, our approach tackles the harder task of predicting a moreprecise diff but still achieves over double the accuracy.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Import2vec - Learning Embeddings for Software Libraries

+

Bart Theeten, Frederik Vandeputte, Tom Van Cutsem. MSR 2019

+

+ + + +
+ + representation + +

+

We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning tasks (e.g. classification) or applications such as contextual search and analogical reasoning.

+ +

We apply word embedding techniques from natural language processing (NLP) to train embeddings for library packages (“library vectors”). Library vectors represent libraries by similar context of use as determined by import statements present in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python).

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair

+

Haoye Tian, Kui Liu, Abdoul Kader Kaboreé, Anil Koyuncu, Li Li, Jacques Klein, Tegawendé F. Bissyandé. 2020

+

+ + [ArXiV] + + + +
+ + repair + + Transformer + +

+

A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study the benefit of learning code representations to learn deep features that may encode the properties of patch correctness. Our work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings associated with logistic regression yielded an AUC value of about 0.8 in predicting patch correctness on a deduplicated dataset of 1000 labeled patches. Our study shows that learned representations can lead to reasonable performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. These representations may further be complementary to features that were carefully (manually) engineered in the literature.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DebugBench: Evaluating Debugging Capability of Large Language Models

+

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Zhiyuan Liu, Maosong Sun. 2024

+

+ + [ArXiV] + + + +
+ + repair + +

+

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs’ debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench’, an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and three open-source models in a zero-shot scenario. We find that (1) while closed-source models like GPT-4 exhibit inferior debugging performance compared to humans, open-source models such as Code Llama fail to attain any pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Simulating Execution Time of Tensor Programs using Graph Neural Networks

+

Jakub M. Tomczak, Romain Lepert, Auke Wiggers. Representation Learning on Graphs and Manifolds at ICLR 2019

+

+ + [ArXiV] + + + +
+ + GNN + +

+

Optimizing the execution time of tensor program, e.g., a convolution, involves finding its optimal configuration. Searching the configuration space exhaustively is typically infeasible in practice. In line with recent research using TVM, we propose to learn a surrogate model to overcome this issue. The model is trained on an acyclic graph called an abstract syntax tree, and utilizes a graph convolutional network to exploit structure in the graph. We claim that a learnable graph-based data processing is a strong competitor to heuristic-based feature extraction. We present a new dataset of graphs corresponding to configurations and their execution time for various tensor programs. We provide baselines for a runtime prediction task.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Recovering Variable Names for Minified Code with Usage Contexts

+

Hieu Tran, Ngoc Tran, Son Nguyen, Hoan Nguyen, Tien N. Nguyen. ICSE 2019

+

+ + + +
+ + naming + + deobfuscation + +

+

In modern Web technology, JavaScript (JS) code plays an important role. To avoid the exposure of original source code, the variable names in JS code deployed in the wild are often replaced by short, meaningless names, thus making the code extremely difficult to manually understand and analysis. This paper presents JSNeat, an information retrieval (IR)-based approach to recover the variable names in minified JS code. JSNeat follows a data-driven approach to recover names by searching for them in a large corpus of open-source JS code. We use three types of contexts to match a variable in given minified code against the corpus including the context of properties and roles of the variable, the context of that variable and relations with other variables under recovery, and the context of the task of the function to which the variable contributes. We performed several empirical experiments to evaluate JSNeat on the dataset of more than 322K JS files with 1M functions, and 3.5M variables with 176K unique variable names. We found that JSNeat achieves a high accuracy of 69.1%, which is the relative improvements of 66.1% and 43% over two state-of-the-art approaches JSNice and JSNaughty, respectively. The time to recover for a file or for a variable with JSNeat is twice as fast as with JSNice and 4x as fast as with JNaughty, respectively.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On the Localness of Software

+

Zhaopeng Tu, Zhendong Su, Premkumar Devanbu. FSE 2014

+

+ + + +
+ + language model + +

+

The n-gram language model, which has its roots in statistical natural +language processing, has been shown to successfully capture the +repetitive and predictable regularities (“naturalness”) of source code, +and help with tasks such as code suggestion, porting, and designing +assistive coding devices. However, we show in this paper that this +natural-language-based model fails to exploit a special property of +source code: localness. We find that human-written programs are +localized: they have useful local regularities that can be captured +and exploited. We introduce a novel cache language model that +consists of both an n-gram and an added “cache” component to +exploit localness. We show empirically that the additional cache +component greatly improves the n-gram approach by capturing +the localness of software, as measured by both cross-entropy and +suggestion accuracy. Our model’s suggestion accuracy is actually +comparable to a state-of-the-art, semantically augmented language +model; but it is simpler and easier to implement. Our cache language +model requires nothing beyond lexicalization, and thus is applicable +to all programming languages.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Learning Similarities from Different Representations of Source Code

+

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk. MSR 2018

+

+ + + +
+ + representation + + clone + +

+

Assessing the similarity between code components plays a pivotal +role in a number of Software Engineering (SE) tasks, such as clone +detection, impact analysis, refactoring, etc. +Code similarity is generally measured by relying on manually defined or hand-crafted +features, e.g., by analyzing the overlap among identifiers or comparing the Abstract Syntax Trees of two code components. These +features represent a best guess at what SE researchers can utilize to +exploit and reliably assess code similarity for a given task. Recent +work has shown, when using a stream of identifiers to represent +the code, that Deep Learning (DL) can effectively replace manual +feature engineering for the task of clone detection. However, source +code can be represented at different levels of abstraction: identifiers, Abstract Syntax Trees, Control Flow Graphs, and Bytecode. +We conjecture that each code representation can provide a different, +yet orthogonal view of the same code fragment, thus, enabling a +more reliable detection of similarities in code. In this paper, we +demonstrate how SE tasks can benefit from a DL-based approach, +which can automatically learn code similarities from different representations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation

+

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk. 2018

+

+ + + +
+ + repair + +

+

Millions of open-source projects with numerous bug fixes are available in code repositories. This proliferation of software development histories can be leveraged to learn how to fix common programming bugs. To explore such a potential, we perform an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. First, we mine millions of bug-fixes from the change histories of projects hosted on GitHub, in order to extract meaningful examples of such bug-fixes. Next, we abstract the buggy and corresponding fixed code, and use them to train an Encoder-Decoder model able to translate buggy code into its fixed version. In our empirical investigation we found that such a model is able to fix thousands of unique buggy methods in the wild. Overall, this model is capable of predicting fixed patches generated by developers in 9-50% of the cases, depending on the number of candidate patches we allow it to generate. Also, the model is able to emulate a variety of different Abstract Syntax Tree operations and generate candidate patches in a split second.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning How to Mutate Source Code from Bug-Fixes

+

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk. 2018

+

+ + + +
+ + repair + + edit + +

+

Mutation testing has been widely accepted as an approach to guide test case generation or to assess the effectiveness of test suites. Empirical studies have shown that mutants are representative of real faults; yet they also indicated a clear need for better, possibly customized, mutation operators and strategies. While some recent papers have tried to devise domain-specific or general purpose mutator operators by manually analyzing real faults, such an activity is effort- (and error-) prone and does not deal with an important practical question as to how to really mutate a given source code element. We propose a novel approach to automatically learn mutants from faults in real programs. First, our approach processes bug fixing changes using fine-grained differencing, code abstraction, and change clustering. Then, it learns mutation models using a deep learning strategy. We have trained and evaluated our technique on a set of ~787k bugs mined from GitHub. Starting from code fixed by developers in the context of a bug-fix, our empirical evaluation showed that our models are able to predict mutants that resemble original fixed bugs in between 9% and 45% of the cases (depending on the model). Moreover, over 98% of the automatically generated mutants are lexically and syntactically correct.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

On Learning Meaningful Code Changes via Neural Machine Translation

+

Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, Denys Poshyvanyk. ICSE 2019

+

+ + + +
+ + repair + + edit + +

+

Recent years have seen the rise of Deep Learning (DL) techniques applied to source code. Researchers have exploited DL to automate several development and maintenance tasks, such as writing commit messages, generating comments and detecting vulnerabilities among others. One of the long lasting dreams of applying DL to code is the possibility to automate non-trivial coding activities. While some steps in this direction have been taken (e.g., learning how to fix bugs), there is still a lack of empirical evidence on the types of code changes that can be learned and automatically applied by DL. Our goal is to make this first step by quantitatively and qualitatively investigating the ability of a Neural Machine Translation (NMT) model to learn how to automatically apply code changes implemented by developers during pull requests. We train and experiment with the NMT model on a set of 236k pairs of code components before and after the implementation of the changes provided in the pull requests. We show that, when applied in a narrow enough context (i.e., small/medium-sized pairs of methods before/after the pull request changes), NMT can automatically replicate the changes implemented by developers during pull requests in up to 36% of the cases. Moreover, our qualitative analysis shows that the model is capable of learning and replicating a wide variety of meaningful code changes, especially refactorings and bug-fixing activities. Our results pave the way to novel research in the area of DL on code, such as the automatic learning and applications of refactoring.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers

+

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, Neel Sundaresan. ICSE 2020

+

+ + [ArXiV] + + + +
+ + code generation + + synthesis + + test generation + +

+

Unit testing represents the foundational basis of the software testing pyramid, beneath integration and end-to-end testing. Automated software testing researchers have proposed a variety of techniques to assist developers in this time-consuming task. In this paper we present an approach to support developers in writing unit test cases by generating accurate and useful assert statements. Our approach is based on a state-of-the-art transformer model initially pretrained on an English textual corpus. This semantically rich model is then trained in a semi-supervised fashion on a large corpus of source code. Finally, we finetune this model on the task of generating assert statements for unit tests. The resulting model is able to generate accurate assert statements for a given method under test. In our empirical evaluation, the model was able to predict the exact assert statements written by developers in 62% of the cases in the first attempt. The results show 80% relative improvement for top-1 accuracy over the previous RNN-based approach in the literature. We also show the substantial impact of the pretraining process on the performances of our model, as well as comparing it with assert auto-completion task. Finally, we demonstrate how our approach can be used to augment EvoSuite test cases, with additional asserts leading to improved test coverage.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Unit Test Case Generation with Transformers

+

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, Neel Sundaresan. ICSE 2020

+

+ + [ArXiV] + + + +
+ + code generation + + synthesis + + test generation + +

+

Automated Unit Test Case generation has been the focus of extensive literature within the research community. Existing approaches are usually guided by the test coverage criteria, generating synthetic test cases that are often difficult to read or understand for developers. In this paper we propose AthenaTest, an approach that aims at generating unit test cases by learning from real-world, developer-written test cases. Our approach relies on a state-of-the-art sequence-to-sequence transformer model which is able to write useful test cases for a given method under test (i.e., focal method). We also introduce methods2test - the largest publicly available supervised parallel corpus of unit test case methods and corresponding focal methods in Java, which comprises 630k test cases mined from 70k open-source repositories hosted on GitHub. We use this dataset to train a transformer model to translate focal methods into the corresponding test cases. We evaluate the ability of our model in generating test cases using natural language processing as well as code-specific criteria. First, we assess the quality of the translation compared to the target test case, then we analyze properties of the test case such as syntactic correctness and number and variety of testing APIs (e.g., asserts). We execute the test cases, collect test coverage information, and compare them with test cases generated by EvoSuite and GPT-3. Finally, we survey professional developers on their preference in terms of readability, understandability, and testing effectiveness of the generated test cases.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models

+

Priyan Vaithilingam, Tianyi Zhang, Elena Glassman. CHI 2022

+

+ + [Preprint] + + + +
+ + human evaluation + + code generation + + language model + +

+

Recent advances in Large Language Models (LLM) have made automatic code generation possible for real-world programming tasks in +general-purpose programming languages such as Python. However, +there are few human studies on the usability of these tools and how +they fit the programming workflow. In this work, we conducted +a within-subjects user study with 24 participants to understand +how programmers use and perceive Copilot, a LLM-based code +generation tool. We found that, while Copilot did not necessarily +improve the task completion time or success rate, most participants preferred to use Copilot in daily programming tasks, since +Copilot often provided a useful starting point and saved the effort +of searching online. However, participants did face difficulties in +understanding, editing, and debugging code snippets generated +by Copilot, which significantly hindered their task-solving effectiveness. Finally, we highlighted several promising directions for +improving the design of Copilot based on our observations and +participants’ feedback.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Program Repair by Jointly Learning to Localize and Repair

+

Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, Rishabh Singh. ICLR 2019

+

+ + + +
+ + repair + + program analysis + + variable misuse + +

+

Due to its potential to improve programmer productivity and software quality, automated program repair has been an active topic of research. Newer techniques harness neural networks to learn directly from examples of buggy programs and their fixes. In this work, we consider a recently identified class of bugs called variable-misuse bugs. The state-of-the-art solution for variable misuse enumerates potential fixes for all possible bug locations in a program, before selecting the best prediction. We show that it is beneficial to train a model that jointly and directly localizes and repairs variable-misuse bugs. We present multi-headed pointer networks for this purpose, with one head each for localization and repair. The experimental results show that the joint model significantly outperforms an enumerative solution that uses a pointer based model for repair alone.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Recovering Clear, Natural Identifiers from Obfuscated JS Names

+

Bogdan Vasilescu, Casey Casalnuovo, Premkumar Devanbu. FSE 2017

+

+ + + +
+ + deobfuscation + + naming + +

+

Well-chosen variable names are critical to source code readability, reusability, and maintainability. Unfortunately, in deployed JavaScript code (which is ubiquitous on the web) the identifier names are frequently minified and overloaded. This is done both for efficiency and also to protect potentially proprietary intellectual property. In this paper, we describe an approach based on statistical machine translation (SMT) that recovers some of the original names from the JavaScript programs minified by the very popular UglifyJS. This simple tool, Autonym, performs comparably to the best currently available deobfuscator for JavaScript, JSNice, which uses sophisticated static analysis. In fact, Autonym is quite complementary to JSNice, performing well when it does not, and vice versa. We also introduce a new tool, JSNaughty, which blends Autonym and JSNice, and significantly outperforms both at identifier name recovery, while remaining just as easy to use as JSNice. JSNaughty is available online at http://jsnaughty.org.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

ConTest: A Unit Test Completion Benchmark featuring Context

+

Johannes Villmow, Jonas Depoix, Adrian Ulges. NLP4Prog 2021

+

+ + [PDF] + + + +
+ + benchmark + + dataset + + verification + + Transformer + +

+

We introduce CONTEST, a benchmark for NLP-based unit test completion, the task of predicting a test’s assert statements given its setup and focal method, i.e. the method to be tested. ConTest is large-scale (with 365k datapoints). Besides the test code and tested code, it also features context code called by either. We found context to be crucial for accurately predicting assertions. We also introduce baselines based on transformer encoder-decoders, and study the effects of including syntactic information and context. Overall, our models achieve a BLEU score of 38.2, while only generating unparsable code in 1.92% of cases.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Improving Automatic Source Code Summarization via Deep Reinforcement Learning

+

Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, Philip S. Yu. ASE 2018

+

+ + [ACM] + + + +
+ + summarization + + documentation + +

+

Code summarization provides a high level natural language description of the function performed by code, as it can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization; b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given. However, it is expected to generate the entire sequence from scratch at test time. This discrepancy can cause an exposure bias issue, making the learnt decoder suboptimal. In this paper, we incorporate an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network). The actor network provides the confidence of predicting the next word according to current state. On the other hand, the critic network evaluates the reward value of all possible extensions of the current state and can provide global guidance for explorations. We employ an advantage reward composed of BLEU metric to train both networks. Comprehensive experiments on a real-world dataset show the effectiveness of our proposed model when compared with some state-of-the-art methods.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Multi-Modal Attention Network Learning for Semantic Source Code Retrieval

+

Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, Philip S. Yu. 2019

+

+ + [ArXiV] + + + +
+ + search + +

+

Code retrieval techniques and tools have been playing a key role in facilitating software developers to retrieve existing code fragments from available open-source repositories given a user query. Despite the existing efforts in improving the effectiveness of code retrieval, there are still two main issues hindering them from being used to accurately retrieve satisfiable code fragments from large-scale repositories when answering complicated queries. First, the existing approaches only consider shallow features of source code such as method names and code tokens, but ignoring structured features such as abstract syntax trees (ASTs) and control-flow graphs (CFGs) of source code, which contains rich and well-defined semantics of source code. Second, although the deep learning-based approach performs well on the representation of source code, it lacks the explainability, making it hard to interpret the retrieval results and almost impossible to understand which features of source code contribute more to the final results.

+ +

To tackle the two aforementioned issues, this paper proposes MMAN, a novel Multi-Modal Attention Network for semantic source code retrieval. A comprehensive multi-modal representation is developed for representing unstructured and structured features of source code, with one LSTM for the sequential tokens of code, a Tree-LSTM for the AST of code and a GGNN (Gated Graph Neural Network) for the CFG of code. Furthermore, a multi-modal attention fusion layer is applied to assign weights to different parts of each modality of source code and then integrate them into a single hybrid representation. Comprehensive experiments and analysis on a large-scale real-world dataset show that our proposed model can accurately retrieve code snippets and outperforms the state-of-the-art methods.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

NaturalCC: A Toolkit to Naturalize the Source Code Corpus

+

Yao Wan, Yang He, Jian-Guo Zhang, Yulei Sui, Hai Jin, Guandong Xu, Caiming Xiong, Philip S. Yu. 2020

+

+ + [ArXiV] + + [website] + + [code] + + + +
+ + documentation + + search + + summarization + +

+

We present NaturalCC, an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis. Using NaturalCC, researchers both from natural language or programming language communities can quickly and easily reproduce the state-of-the-art baselines and implement their approach. NaturalCC is built upon Fairseq and PyTorch, providing (1) an efficient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modular and extensible framework that makes it easy to reproduce or implement an approach for big code analysis, and (3) a command line interface and a graphical user interface to demonstrate each model’s performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion, code comment generation, and code retrieval) for demonstration. The video of this demo is available at https://www.youtube.com/watch?v=q4W5VSI-u3E&t=25s.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

+

Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, Hai Jin. ICSE 2022

+

+ + [ArXiV] + + [Code] + + + +
+ + Transformer + + pretraining + + program analysis + +

+

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Automatically Learning Semantic Features for Defect Prediction

+

Song Wang, Taiyue Liu, Lin Tan. ICSE 2016

+

+ + + +
+ + defect + + representation + +

+

Software defect prediction, which predicts defective code regions, can help developers find bugs and prioritize their testing efforts. To build accurate prediction models, previous +studies focus on manually designing features that encode the +characteristics of programs and exploring different machine +learning algorithms. Existing traditional features often fail +to capture the semantic differences of programs, and such a +capability is needed for building accurate prediction models.

+ +

To bridge the gap between programs’ semantics and +defect prediction features, this paper proposes to leverage a +powerful representation-learning algorithm, deep learning, +to learn semantic representation of programs automatically +from source code. Specifically, we leverage Deep Belief +Network (DBN) to automatically learn semantic features +from token vectors extracted from programs’ Abstract +Syntax Trees (ASTs).

+ +

Our evaluation on ten open source projects shows that +our automatically learned semantic features significantly improve both within-project defect prediction (WPDP) and +cross-project defect prediction (CPDP) compared to traditional features. Our semantic features improve WPDP on +average by 14.7% in precision, 11.5% in recall, and 14.2% +in F1. For CPDP, our semantic features based approach +outperforms the state-of-the-art technique TCA+ with traditional features by 8.9% in F1.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Bugram: bug detection with n-gram language models

+

Song Wang, Devin Chollak, Dana Movshovitz-Attias, Lin Tan. ASE 2016

+

+ + + +
+ + defect + + representation + +

+

To improve software reliability, many rule-based techniques have been proposed to infer programming rules and detect violations of these rules as bugs. These rule-based approaches often rely on the highly frequent appearances of certain patterns in a project to infer rules. It is known that if a pattern does not appear frequently enough, rules are not learned, thus missing many bugs.

+ +

In this paper, we propose a new approach—Bugram—that leverages n-gram language models instead of rules to detect bugs. Bugram models program tokens sequentially, using the n-gram language model. Token sequences from the program are then assessed according to their probability in the learned model, and low probability sequences are marked as potential bugs. The assumption is that low probability token sequences in a program are unusual, which may indicate bugs, bad practices, or unusual/special uses of code of which developers may want to be aware.

+ +

We evaluate Bugram in two ways. First, we apply Bugram on the latest versions of 16 open source Java projects. Results show that Bugram detects 59 bugs, 42 of which are manually verified as correct, 25 of which are true bugs and 17 are code snippets that should be refactored. Among the 25 true bugs, 23 cannot be detected by PR-Miner. We have reported these bugs to developers, 7 of which have already been confirmed by developers (4 of them have already been fixed), while the rest await confirmation. Second, we further compare Bugram with three additional graph- and rule-based bug detection tools, i.e., JADET, Tikanga, and GrouMiner. We apply Bugram on 14 Java projects evaluated in these three studies. Bugram detects 21 true bugs, at least 10 of which cannot be detected by these three tools. Our results suggest that Bugram is complementary to existing rule-based bug detection approaches.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Code Completion

+

Chang Liu, Xin Wang, Richard Shin, Joseph E. Gonzalez, Dawn Song. 2016

+

+ + + +
+ + autocomplete + +

+

Code completion, an essential part of modern software development, yet can be +challenging for dynamically typed programming languages. In this paper we explore the use of neural network techniques to automatically learn code completion +from a large corpus of dynamically typed JavaScript code. We show different +neural networks that leverage not only token level information but also structural +information, and evaluate their performance on different prediction tasks. We +demonstrate that our models can outperform the state-of-the-art approach, which +is based on decision tree techniques, on both next non-terminal and next terminal +prediction tasks by 3.8 points and 0.5 points respectively. We believe that neural +network techniques can play a transformative role in helping software developers +manage the growing complexity of software systems, and we see this work as a +first step in that direction.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Scalable and Precise Representation of Program Semantics

+

Ke Wang. 2019

+

+ + + +
+ + representation + + dynamic + +

+

Neural program embedding has shown potential in aiding the analysis of large-scale, complicated software. Newly proposed deep neural architectures pride themselves on learning program semantics rather than superficial syntactic features. However, by considering the source code only, the vast majority of neural networks do not capture a deep, precise representation of program semantics. In this paper, we present \dypro, a novel deep neural network that learns from program execution traces. Compared to the prior dynamic models, not only is \dypro capable of generalizing across multiple executions for learning a program’s dynamic semantics in its entirety, but \dypro is also more efficient when dealing with programs yielding long execution traces. For evaluation, we task \dypro with semantic classification (i.e. categorizing programs based on their semantics) and compared it against two prominent static models: Gated Graph Neural Network and TreeLSTM. We find that \dypro achieves the highest prediction accuracy among all models. To further reveal the capacity of all aforementioned deep neural architectures, we examine if the models can learn to detect deeper semantic properties of a program. In particular given a task of recognizing loop invariants, we show \dypro beats all static models by a wide margin.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Blended, precise semantic program embeddings

+

Ke Wang, Zhendong Su. PLDI 2020

+

+ + + +
+ + dynamic + +

+

Learning neural program embeddings is key to utilizing deep neural networks in program languages research — precise and efficient program representations enable the application of deep models to a wide range of program analysis tasks. Existing approaches predominately learn to embed programs from their source code, and, as a result, they do not capture deep, precise program semantics. On the other hand, models learned from runtime information critically depend on the quality of program executions, thus leading to trained models with highly variant quality. This paper tackles these inherent weaknesses of prior approaches by introducing a new deep neural network, Liger, which learns program representations from a mixture of symbolic and concrete execution traces. We have evaluated Liger on two tasks: method name prediction and semantics classification. Results show that Liger is significantly more accurate than the state-of-the-art static model code2seq in predicting method names, and requires on average around 10x fewer executions covering nearly 4x fewer paths than the state-of-the-art dynamic model DYPRO in both tasks. Liger offers a new, interesting design point in the space of neural program embeddings and opens up this new direction for exploration.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CoCoGUM: Contextual Code Summarization with Multi-Relational GNN on UMLs

+

Yanlin Wang, Lun Du, Ensheng Shi, Yuxuan Hu, Shi Han, Dongmei Zhang. 2020

+

+ + [TR] + + + +
+ + summarization + +

+

Code summaries are short natural language (NL) descriptions of code snippets that help developers better understand and maintain source code. Due to the pivotal role of code summaries in software development and maintenance, there is a surge of works on automatic code summarization to reduce the heavy burdens of developers. However, contemporary approaches only leverage the information within the boundary of the method being summarized (i.e., local context), and ignore that using broader context could assist with code summarization. In this paper, we explore two global context information, namely intra-class and inter-class context information, and propose the model CoCoGUM: Contextual Code Summarization with Multi-Relational Graph Neural Networks on UMLs. CoCoGUM first incorporates class names as the intra-class context, which is further fed to a Transformer-based sentence embedding model to extract the class lexical embeddings. Then, relevant Unified Modeling Language (UML) class diagrams are extracted as inter-class context and we use a Multi-Relational Graph Neural Network (MR-GNN) to encode the class relational embeddings. Class lexical embeddings and class relational embeddings, together with the outputs from code token encoder and AST encoder, are passed to the decoder armed with a two-level attention mechanism to generate high-quality context-aware code summaries. We conduct extensive experiments to evaluate our approach and compare it with other automatic code summarization models. The experimental results show that CoCoGUM outperforms state-of-the-art methods.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

+

Wenhan Wang, Ge Li, Bo Ma, Xin Xia, Zhi Jin. IEEE International Conference on Software Analysis, Evolution, and Reengineering 2020

+

+ + [ArXiV] + + + +
+ + clone + + GNN + +

+

Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Semantic Program Embeddings with Graph Interval Neural Network

+

Yu Wang, Fengjuan Gao, Linzhang Wang, Ke Wang. 2020

+

+ + [ArXiV] + + + +
+ + GNN + + defect + +

+

Learning distributed representations of source code has been a challenging task for machine learning models. Earlier works treated programs as text so that natural language methods can be readily applied. Unfortunately, such approaches do not capitalize on the rich structural information possessed by source code. Of late, Graph Neural Network (GNN) was proposed to learn embeddings of programs from their graph representations. Due to the homogeneous and expensive message-passing procedure, GNN can suffer from precision issues, especially when dealing with programs rendered into large graphs. In this paper, we present a new graph neural architecture, called Graph Interval Neural Network (GINN), to tackle the weaknesses of the existing GNN. Unlike the standard GNN, GINN generalizes from a curated graph representation obtained through an abstraction method designed to aid models to learn. In particular, GINN focuses exclusively on intervals for mining the feature representation of a program, furthermore, GINN operates on a hierarchy of intervals for scaling the learning to large graphs. We evaluate GINN for two popular downstream applications: variable misuse prediction and method name prediction. Results show in both cases GINN outperforms the state-of-the-art models by a comfortable margin. We have also created a neural bug detector based on GINN to catch null pointer deference bugs in Java code. While learning from the same 9,000 methods extracted from 64 projects, GINN-based bug detector significantly outperforms GNN-based bug detector on 13 unseen test projects. Next, we deploy our trained GINN-based bug detector and Facebook Infer to scan the codebase of 20 highly starred projects on GitHub. Through our manual inspection, we confirm 38 bugs out of 102 warnings raised by GINN-based bug detector compared to 34 bugs out of 129 warnings for Facebook Infer.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Represent Programs with Heterogeneous Graphs

+

Wenhan Wang, Kechi Zhang, Ge Li, Zhi Jin. 2020

+

+ + [ArXiV] + + + +
+ + GNN + + summarization + +

+

Program source code contains complex structure information, which can be represented in structured data forms like trees or graphs. To acquire the structural information in source code, most existing researches use abstract syntax trees (AST). A group of works add additional edges to ASTs to convert source code into graphs and use graph neural networks to learn representations for program graphs. Although these works provide additional control or data flow information to ASTs for downstream tasks, they neglect an important aspect of structure information in AST itself: the different types of nodes and edges. In ASTs, different nodes contain different kinds of information like variables or control flow, and the relation between a node and all its children can also be different.

+ +

To address the information of node and edge types, we bring the idea of heterogeneous graphs to learning on source code and present a new formula of building heterogeneous program graphs from ASTs with additional type information for nodes and edges. We use the ASDL grammar of programming language to define the node and edge types of program graphs. Then we use heterogeneous graph neural networks to learn on these graphs. We evaluate our approach on two tasks: code comment generation and method naming. Both tasks require reasoning on the semantics of complete code snippets. Experiment results show that our approach outperforms baseline models, including homogeneous graph-based models, showing that leveraging the type information of nodes and edges in program graphs can help in learning program semantics.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Modular Tree Network for Source Code Representation Learning

+

Wenhan Wang, Ge Li, Sijie Shen, Xin Xia, Zhi Jin. TOSEM 2020

+

+ + [ACM] + + + +
+ + grammar + + representation + +

+

Learning representation for source code is a foundation of many program analysis tasks. In recent years, neural networks have already shown success in this area, but most existing models did not make full use of the unique structural information of programs. Although abstract syntax tree (AST)-based neural models can handle the tree structure in the source code, they cannot capture the richness of different types of substructure in programs. In this article, we propose a modular tree network that dynamically composes different neural network units into tree structures based on the input AST. Different from previous tree-structural neural network models, a modular tree network can capture the semantic differences between types of AST substructures. We evaluate our model on two tasks: program classification and code clone detection. Our model achieves the best performance compared with state-of-the-art approaches in both tasks, showing the advantage of leveraging more elaborate structure information of the source code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

TranS^3: A Transformer-based Framework for Unifying Code Summarization and Code Search

+

Wenhua Wang, Yuqun Zhang, Zhengran Zeng, Guandong Xu. 2020

+

+ + [ArXiV] + + + +
+ + search + + documentation + +

+

Code summarization and code search have been widely adopted in sofwaredevelopmentandmaintenance. However, fewstudieshave explored the efcacy of unifying them. In this paper, we propose TranS^3 , a transformer-based framework to integrate code summarization with code search. Specifcally, for code summarization,TranS^3 enables an actor-critic network, where in the actor network, we encode the collected code snippets via transformer- and tree-transformer-based encoder and decode the given code snippet to generate its comment. Meanwhile, we iteratively tune the actor network via the feedback from the critic network for enhancing the quality of the generated comments. Furthermore, we import the generated comments to code search for enhancing its accuracy. To evaluatetheefectivenessof TranS^3 , we conduct a set of experimental studies and case studies where the experimental results suggest that TranS^3 can signifcantly outperform multiple state-of-the-art approaches in both code summarization and code search and the study results further strengthen the efcacy of TranS^3 from the developers’ points of view.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

+

Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. EMNLP 2021

+

+ + [ArXiV] + + [Code & Model] + + + +
+ + Transformer + +

+

Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at https://github.com/salesforce/CodeT5 .

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

+

Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, Xin Jiang. 2021

+

+ + [ArXiV] + + + +
+ + pretraining + +

+

Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP), which are designed to predict identifiers, and edges between two nodes of AST, respectively. Meanwhile, to exploit the complementary information in semantically equivalent modalities (i.e., code, comment, AST) of the code, we propose a multi-modal contrastive learning strategy to maximize the mutual information among different modalities. Extensive experiments on four downstream tasks related to code intelligence show that SynCoBERT advances the state-of-the-art with the same pre-training corpus and model size.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

+

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, Steven C. H. Hoi. 2023

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+’’, a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DeepVD: Toward Class-Separation Features for Neural Network Vulnerability Detection

+

Wenbo Wang, Tien N. Nguyen, Shaohua Wang, Yi Li, Jiyuan Zhang, Aashish Yadavally. ICSE 2023

+

+ + [website] + + [code] + + + +
+ + vulnerability + +

+

The advances of machine learning (ML) including deep learning (DL) have enabled several approaches to implicitly learn vulnerable code patterns to automatically detect software vulnerabilities. A recent study showed that despite successes, the existing ML/DL-based vulnerability detection (VD) models are limited in the ability to distinguish between the two classes of vulnerability and benign code. We propose DeepVD, a graph-based neural network VD model that emphasizes on class-separation features between vulnerability and benign code. DeepVD leverages three types of class-separation features at different levels of abstraction: statement types (similar to Part-of-Speech tagging), Post-Dominator Tree (covering regular flows of execution), and Exception Flow Graph (covering the exception and error-handling flows). We conducted several experiments to evaluate DeepVD in a real-world vulnerability dataset of 303 projects with 13,130 vulnerable methods. Our results show that DeepVD relatively improves over the state-of-the-art ML/DL-based VD approaches 13%–29.6% in precision, 15.6%–28.9% in recall, and 16.4%–25.8% in F-score. Our ablation study confirms that our designed features and components help DeepVD achieve high class-separability for vulnerability and benign code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research

+

Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, Denys Poshyvanyk. TSE 2021

+

+ + [ArXiV] + + [website] + + [code] + + + +
+ + survey + +

+

An increasingly popular set of techniques adopted by software engineering (SE) researchers to automate development tasks are those rooted in the concept of Deep Learning (DL). The popularity of such techniques largely stems from their automated feature engineering capabilities, which aid in modeling software artifacts. However, due to the rapid pace at which DL techniques have been adopted, it is difficult to distill the current successes, failures, and opportunities of the current research landscape. In an effort to bring clarity to this crosscutting area of work, from its modern inception to the present, this paper presents a systematic literature review of research at the intersection of SE & DL. The review canvases work appearing in the most prominent SE and DL conferences and journals and spans 128 papers across 23 unique SE tasks. We center our analysis around the components of learning, a set of principles that govern the application of machine learning techniques (ML) to a given problem domain, discussing several aspects of the surveyed work at a granular level. The end result of our analysis is a research roadmap that both delineates the foundations of DL techniques applied to SE research, and highlights likely areas of fertile exploration for the future.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

+

Yaza Wainakh, Moiz Rauf, Michael Pradel. ICSE 2021

+

+ + [ArXiV] + + + +
+ + representation + +

+

Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of namebased analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., len and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 500 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to be similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Code Generation as a Dual Task of Code Summarization

+

Bolin Wei, Ge Li, Xin Xia, Zhiyi Fu, Zhi Jin. NeurIPS 2019

+

+ + [ArXiV] + + + +
+ + code generation + + summarization + +

+

Code summarization (CS) and code generation (CG) are two crucial tasks in the field of automatic software development. Various neural network-based approaches are proposed to solve these two tasks separately. However, there exists a specific intuitive correlation between CS and CG, which have not been exploited in previous work. In this paper, we apply the relations between two tasks to improve the performance of both tasks. In other words, exploiting the duality between the two tasks, we propose a dual training framework to train the two tasks simultaneously. In this framework, we consider the dualities on probability and attention weights, and design corresponding regularization terms to constrain the duality. We evaluate our approach on two datasets collected from GitHub, and experimental results show that our dual framework can improve the performance of CS and CG tasks over baselines.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

LambdaNet: Probabilistic Type Inference using Graph Neural Networks

+

Jiayi Wei, Maruth Goyal, Greg Durrett, Isil Dillig. ICLR 2020

+

+ + [OpenReview] + + [ArXiV] + + [Code] + + + +
+ + GNN + + types + +

+

As gradual typing becomes increasingly popular in languages like Python and TypeScript, there is a growing need to infer type annotations automatically. While type annotations help with tasks like code completion and static error catching, these annotations cannot be fully inferred by compilers and are tedious to annotate by hand. This paper proposes a probabilistic type inference scheme for TypeScript based on a graph neural network. Our approach first uses lightweight source code analysis to generate a program abstraction called a type dependency graph, which links type variables with logical constraints as well as name and usage information. Given this program abstraction, we then use a graph neural network to propagate information between related type variables and eventually make type predictions. Our neural architecture can predict both standard types, like number or string, as well as user-defined types that have not been encountered during training. Our experimental results show that our approach outperforms prior work in this space by 14% (absolute) on library types, while having the ability to make type predictions that are out of scope for existing techniques.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

TypeT5: Seq2seq Type Inference using Static Analysis

+

Jiayi Wei, Greg Durrett, Isil Dillig. ICLR 2023

+

+ + [ArXiV] + + + +
+ + types + + Transformer + +

+

There has been growing interest in automatically predicting missing type annotations in programs written in Python and JavaScript. While prior methods have achieved impressive accuracy when predicting the most common types, they often perform poorly on rare or complex types. In this paper, we present a new type inference method that treats type prediction as a code infilling task by leveraging CodeT5, a state-of-the-art seq2seq pre-trained language model for code. Our method uses static analysis to construct dynamic contexts for each code element whose type signature is to be predicted by the model. We also propose an iterative decoding scheme that incorporates previous type predictions in the model’s input context, allowing information exchange between related code elements. Our evaluation shows that the proposed approach, TypeT5, not only achieves a higher overall accuracy (particularly on rare and complex types) but also produces more coherent results with fewer type errors – while enabling easy user intervention.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Toward Deep Learning Software Repositories

+

Martin White, Christopher Vendome, Mario Linares-Vasquez, Denys Poshyvanyk. MSR 2015

+

+ + + +
+ + representation + +

+

Deep learning subsumes algorithms that automatically learn compositional representations. The ability of these +models to generalize well has ushered in tremendous advances +in many fields such as natural language processing (NLP). +Recent research in the software engineering (SE) community +has demonstrated the usefulness of applying NLP techniques to +software corpora. Hence, we motivate deep learning for software +language modeling, highlighting fundamental differences between +state-of-the-practice software language models and connectionist +models. Our deep learning models are applicable to source +code files (since they only require lexically analyzed source +code written in any programming language) and other types +of artifacts. We show how a particular deep learning model +can remember its state to effectively model sequential data, +e.g., streaming software tokens, and the state is shown to be +much more expressive than discrete tokens in a prefix. Then we +instantiate deep learning models and show that deep learning +induces high-quality models compared to n-grams and cache-based n-grams on a corpus of Java projects. We experiment +with two of the models’ hyperparameters, which govern their +capacity and the amount of context they use to inform predictions, +before building several committees of software language models +to aid generalization. Then we apply the deep learning models to +code suggestion and demonstrate their effectiveness at a real SE +task compared to state-of-the-practice models. Finally, we propose +avenues for future work, where deep learning can be brought to +bear to support model-based testing, improve software lexicons, +and conceptualize software artifacts. Thus, our work serves as +the first step toward deep learning software repositories.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Deep Learning Code Fragments for Code Clone Detection

+

Martin White, Michele Tufano, Christopher Vendome, Denys Poshyvanyk.. ASE 2016

+

+ + + +
+ + clone + +

+

Code clone detection is an important problem for software +maintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These +techniques also depend on generic, handcrafted features to +represent code fragments. We introduce learning-based detection techniques where everything for representing terms +and fragments in source code is mined from the repository. +Our code analysis supports a framework, which relies on +deep learning, for automatically linking patterns mined at +the lexical level with patterns mined at the syntactic level. +We evaluated our novel learning-based approach for code +clone detection with respect to feasibility from the point +of view of software maintainers. We sampled and manually +evaluated 398 file- and 480 method-level pairs across eight +real-world Java systems; 93% of the file- and method-level +samples were evaluated to be true positives. Among the true +positives, we found pairs mapping to all four clone types. We +compared our approach to a traditional structure-oriented +technique and found that our learning-based approach detected clones that were either undetected or suboptimally +reported by the prominent tool Deckard. Our results affirm +that our learning-based approach is suitable for clone detection and a tenable technique for researchers.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities

+

Martin White, Michele Tufano, Matias Martinez, Martin Monperrus, Denys Poshyvanyk. SANER 2017

+

+ + + +
+ + repair + +

+

In the field of automated program repair, the redundancy assumption claims large programs contain the seeds +of their own repair. However, most redundancy-based program +repair techniques do not reason about the repair ingredients—the code that is reused to craft a patch. We aim to reason about +the repair ingredients by using code similarities to prioritize and +transform statements in a codebase for patch generation. Our +approach, DeepRepair, relies on deep learning to reason about +code similarities. Code fragments at well-defined levels of granularity in a codebase can be sorted according to their similarity +to suspicious elements (i.e., code elements that contain suspicious +statements) and statements can be transformed by mapping out-of-scope identifiers to similar identifiers in scope. We examined +these new search strategies for patch generation with respect to +effectiveness from the viewpoint of a software maintainer. Our +comparative experiments were executed on six open-source Java +projects including 374 buggy program revisions and consisted +of 19,949 trials spanning 2,616 days of computation time. DeepRepair’s search strategy using code similarities generally found +compilable ingredients faster than the baseline, jGenProg, but +this improvement neither yielded test-adequate patches in fewer +attempts (on average) nor found significantly more patches than +the baseline. Although the patch counts were not statistically +different, there were notable differences between the nature of +DeepRepair patches and baseline patches. The results demonstrate that our learning-based approach finds patches that cannot +be found by existing redundancy-based repair techniques

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Leveraging Language to Learn Program Abstractions and Search Heuristics

+

Catherine Wong, Kevin Ellis, Joshua B. Tenenbaum, Jacob Andreas. Thirty-eighth International Conference on Machine Learning (ICML 2021) 2021

+

+ + [ArXiV] + + [Poster] + + + +
+ + synthesis + + search + +

+

Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains – string editing, image composition, and abstract reasoning about scenes – even when no natural language hints are available at test time.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback

+

Mike Wu, Noah D. Goodman, Chris Piech, Chelsea Finn. 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + education + +

+

High-quality computer science education is limited by the difficulty of providing instructor feedback to students at scale. While this feedback could in principle be automated, supervised approaches to predicting the correct feedback are bottlenecked by the intractability of annotating large quantities of student code. In this paper, we instead frame the problem of providing feedback as few-shot classification, where a meta-learner adapts to give feedback to student code on a new programming question from just a few examples annotated by instructors. Because data for meta-training is limited, we propose a number of amendments to the typical few-shot learning framework, including task augmentation to create synthetic tasks, and additional side information to build stronger priors about each task. These additions are combined with a transformer architecture to embed discrete sequences (e.g. code) to a prototypical representation of a feedback class label. On a suite of few-shot natural language processing tasks, we match or outperform state-of-the-art performance. Then, on a collection of student solutions to exam questions from an introductory university course, we show that our approach reaches an average precision of 88% on unseen questions, surpassing the 82% precision of teaching assistants. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university. This is, to the best of our knowledge, the first successful deployment of a machine learning based feedback to open-ended student code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Universal Fuzzing via Large Language Models

+

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, Lingming Zhang. 2023

+

+ + [ArXiV] + + + +
+ + fuzzing + +

+

Fuzzing has achieved tremendous success in discovering bugs and vulnerabilities in various software systems. Systems under test (SUTs) that take in programming or formal language as inputs, e.g., compilers, runtime engines, constraint solvers, and software libraries with accessible APIs, are especially important as they are fundamental building blocks of software development. However, existing fuzzers for such systems often target a specific language, and thus cannot be easily applied to other languages or even other versions of the same language. Moreover, the inputs generated by existing fuzzers are often limited to specific features of the input language, and thus can hardly reveal bugs related to other or new features. This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are wellsuited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 76 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 47 bugs already confirmed by developers as previously unknown.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Commit Message Generation for Source Code Changes

+

Shengbin Xu, Yuan Yao, Feng Xu, Tianxiao Gu, Hanghang Tong, Jian Lu. IJCAI 2019

+

+ + + +
+ + edit + + summarization + +

+

Commit messages, which summarize the source +code changes in natural language, are essential for +program comprehension and software evolution understanding. Unfortunately, due to the lack of direct +motivation, commit messages are sometimes neglected by developers, making it necessary to +automatically generate such messages. State-of-the-art adopts learning based approaches such as +neural machine translation models for the commitmessage generation problem. However, they tend +to ignore the code structure information and suffer from the out-of-vocabulary issue. +In this paper, we propose CODISUM to address the above two limitations. In particular, +we first extract both code structure and code semantics from the source code changes, and then +jointly model these two sources of information so as to better learn the representations + of the code changes. Moreover, we augment the model with copying mechanism to further +mitigate the out-of-vocabulary issue. Experimental evaluations on real data demonstrate that +the proposed approach significantly outperforms the state-of-the-art in terms of accurately generating the commit messages.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Method name suggestion with hierarchical attention networks

+

Sihan Xu, Sen Zhang, Weijing Wang, Xinya Cao, Chenkai Guo, Jing Xu.. PEPM 2019

+

+ + + +
+ + naming + +

+

Method Rename has been a widely used refactoring operation that improves program comprehension and maintenance. Descriptive method names that summarize functionalities of source code can facilitate program comprehension. Much research has been done to suggest method names through source code summarization. However, unlike natural language, a code snippet consists of basic blocks organized by complicated structures. In this work, we observe a hierarchical structure — tokens form basic blocks and basic blocks form a code snippet. Based on this observation, we exploit a hierarchical attention network to learn the representation of methods. Specifically, we apply two-level attention mechanism to learn the importance of each token in a basic block and that of a basic block in a method respectively. We evaluated our approach on 10 open source repositories and compared it against three state-of-the-art approaches. The results on these open-source data show the superiority of our hierarchical attention networks in terms of effectiveness.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

+

Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, Graham Neubig. ACL 2020

+

+ + [ArXiV] + + [Code] + + + +
+ + bimodal + + code generation + +

+

Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at [Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Capturing Structural Locality in Non-parametric Language Models

+

Frank F. Xu, Junxian He, Graham Neubig, Vincent J. Hellendoorn. 2021

+

+ + [ArXiV] + + + +
+ + language model + +

+

Structural locality is a ubiquitous feature of real-world datasets, wherein data points are organized into local hierarchies. Some examples include topical clusters in text or project hierarchies in source code repositories. In this paper, we explore utilizing this structural locality within non-parametric language models, which generate sequences that reference retrieved examples from an external source. We propose a simple yet effective approach for adding locality information into such models by adding learned parameters that improve the likelihood of retrieving examples from local neighborhoods. Experiments on two different domains, Java source code and Wikipedia text, demonstrate that locality features improve model efficacy over models without access to these features, with interesting differences. We also perform an analysis of how and where locality features contribute to improved performance and why the traditionally used contextual similarity metrics alone are not enough to grasp the locality structure.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Systematic Evaluation of Large Language Models of Code

+

Frank F. Xu, Uri Alon, Graham Neubig, Vincent J. Hellendoorn. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + language model + +

+

Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at this https URL, which enables future research and application in this area.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

(Partial) Program Dependence Learning

+

Aashish Yadavally, Wenbo Wang, Shaohua Wang, Tien N. Nguyen. ICSE 2023

+

+ + [website] + + [code] + + + +
+ + large language models + + program analysis + + static analysis + + tool + +

+

Code fragments from developer forums often migrate to applications due to the code reuse practice. Owing to the incomplete nature of such programs, analyzing them to early determine the presence of potential vulnerabilities is challenging. In this work, we introduce NeuralPDA, a neural network-based program dependence analysis tool for both complete and partial programs. Our tool efficiently incorporates intra-statement and inter-statement contextual features into statement representations, thereby modeling program dependence analysis as a statement-pair dependence decoding task. In the empirical evaluation, we report that NeuralPDA predicts the CFG and PDG edges in complete Java and C/C++ code with combined F-scores of 94.29% and 92.46%, respectively. The F-score values for partial Java and C/C++ code range from 94.29%–97.17% and 92.46%–96.01%, respectively. We also test the usefulness of the PDGs predicted by NEURALPDA (i.e., PDG) on the downstream task of method-level vulnerability detection. We discover that the performance of the vulnerability detection tool utilizing PDG is only 1.1% less than that utilizing the PDGs generated by a program analysis tool. We also report the detection of 14 real-world vulnerable code snippets from StackOverflow by a machine learning-based vulnerability detection tool that employs the PDGs predicted by NeuralPDA for these code snippets.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Learning-Based Approach to Static Program Slicing

+

Aashish Yadavally, Yi Li, Shaohua Wang, Tien N. Nguyen. OOPSLA 2024

+

+ + [website] + + [code] + + + +
+ + large language models + + program analysis + + static + + tool + +

+

Traditional program slicing techniques are crucial for early bug detection and manual/automated debugging of online code snippets. Nevertheless, their inability to handle incomplete code hinders their real-world applicability in such scenarios. To overcome these challenges, we present NS-Slicer, a novel learning-based approach that predicts static program slices for both complete and partial code. Our tool leverages a pre-trained language model to exploit its understanding of fine-grained variable-statement dependencies within source code. With this knowledge, given a variable at a specific location and a statement in a code snippet, NS-Slicer determines whether the statement belongs to the backward slice or forward slice, respectively. We conducted a series of experiments to evaluate NS-Slicer’s performance. On complete code, it predicts the backward and forward slices with an F1-score of 97.41% and 95.82%, respectively, while achieving an overall F1-score of 96.77%. Notably, in 85.20% of the cases, the static program slices predicted by NS-Slicer exactly match entire slices from the oracle. For partial programs, it achieved an F1-score of 96.77%–97.49% for backward slicing, 92.14%–95.40% for forward slicing, and an overall F1-score of 94.66%–96.62%. Furthermore, we demonstrate NS-Slicer’s utility in vulnerability detection (VD), integrating its predicted slices into an automated VD tool. In this setup, the tool detected vulnerabilities in Java code with a high F1-score of 73.38%. We also include the analyses studying NS-Slicer’s promising performance and limitations, providing insights into its understanding of intrinsic code properties such as variable aliasing, leading to better slicing.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Predictive Program Slicing via Execution Knowledge-Guided Dynamic Dependence Learning

+

Aashish Yadavally, Yi Li, Tien N. Nguyen. FSE 2024

+

+ + [website] + + [code] + + + +
+ + large language models + + program analysis + + dynamic + + tool + +

+

Program slicing, the process of extracting program statements that influence values at a designated location (known as the slicing criterion), is helpful in both manual and automated debugging. However, such slicing techniques prove ineffective in scenarios where executing specific inputs is prohibitively expensive, or even impossible, as with partial code. In this paper, we introduce ND-Slicer, a predictive slicing methodology that caters to specific executions based on a particular input, overcoming the need for actual execution. We enable such a process by leveraging execution-aware pre-training to learn the dynamic program dependencies, including both dynamic data and control dependencies between variables in the slicing criterion and the remaining program statements. Such knowledge forms the cornerstone for constructing a predictive backward slice. Our empirical evaluation revealed a high accuracy in predicting program slices, achieving an exact-match accuracy of 81.3% and a ROUGE-LCS F1-score of 95.4% on Python programs. As an extrinsic evaluation, we illustrate ND-Slicer’s usefulness in crash detection, with it locating faults with an accuracy of 63.9%. Furthermore, we include an in-depth qualitative evaluation, assessing ND-Slicer’s understanding of branched structures such as if-else blocks and loops, as well as the control flow in inter-procedural calls.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Extracting Code from Programming Tutorial Videos

+

Shir Yadid, Eran Yahav. Onward! 2016

+

+ + + +
+ + information extraction + +

+

The number of programming tutorial videos on the web +increases daily. Video hosting sites such as YouTube host +millions of video lectures, with many programming tutorials for various languages and platforms. These videos contain a wealth of valuable information, including code that +may be of interest. However, two main challenges have so +far prevented the effective indexing of programming tutorial +videos: (i) code in tutorials is typically written on-the-fly, +with only parts of the code visible in each frame, and (ii) optical character recognition (OCR) is not precise enough to +produce quality results from videos.

+ +

We present a novel approach for extracting code from +videos that is based on: (i) consolidating code across frames, +and (ii) statistical language models for applying corrections +at different levels, allowing us to make corrections by choosing the most likely token, combination of tokens that form a +likely line structure, and combination of lines that lead to +a likely code fragment in a particular language. We implemented our approach in a tool called ACE , and used it to extract code from 40 Android video tutorials on YouTube . Our +evaluation shows that ACE extracts code with high accuracy, +enabling deep indexing of video tutorials.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries

+

Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, Lingxiao Jiang. SANER 2020

+

+ + [IEEE] + + + +
+ + search + +

+

Code search methods, especially those that allow programmers to raise queries in a natural language, plays an important role in software development. It helps to improve programmers’ productivity by returning sample code snippets from the Internet and/or source-code repositories for their natural-language queries. Meanwhile, there are many code search methods in the literature that support natural-language queries. Difficulties exist in recognizing the strengths and weaknesses of each method and choosing the right one for different usage scenarios, because (1) the implementations of those methods and the datasets for evaluating them are usually not publicly available, and (2) some methods leverage different training datasets or auxiliary data sources and thus their effectiveness cannot be fairly measured and may be negatively affected in practical uses. To build a common ground for measuring code search methods, this paper builds CosBench, a dataset that consists of 1000 projects, 52 code-independent natural-language queries with ground truths, and a set of scripts for calculating four metrics on code research results. We have evaluated four IR (Information Retrieval)-based and two DL (Deep Learning)-based code search methods on CosBench. The empirical evaluation results clearly show the usefulness of the CosBench dataset and various strengths of each code search method. We found that DL-based methods are more suitable for queries on reusing code, and IR-based ones for queries on resolving bugs and learning API uses.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Language Model for Statements of Software Code

+

Yixiao Yang, Yu Jiang, Ming Gu, Jiaguang Sun, Jian Gao, Han Liu. ASE 2017

+

+ + [ACM] + + + +
+ + language model + +

+

Building language models for source code enables a large set of improvements on traditional software engineering tasks. One promising application is automatic code completion. State-of-the-art techniques capture code regularities at token level with lexical information. Such language models are more suitable for predicting short token sequences, but become less effective with respect to long statement level predictions. In this paper, we have proposed PCC to optimize the token level based language modeling. Specifically, PCC introduced an intermediate representation (IR) for source code, which puts tokens into groups using lexeme and variable relative order. In this way, PCC is able to handle long token sequences, i.e., group sequences, to suggest a complete statement with the precise synthesizer. Further more, PCC employed a fuzzy matching technique which combined genetic and longest common sub-sequence algorithms to make the prediction more accurate. We have implemented a code completion plugin for Eclipse and evaluated it on open-source Java projects. The results have demonstrated the potential of PCC in generating precise long statement level predictions. In 30%-60% of the cases, it can correctly suggest the complete statement with only six candidates, and 40%-90% of the cases with ten candidates.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Survey on Deep Learning for Software Engineering

+

Yanming Yang, Xin Xia, David Lo, John Grundy. 2020

+

+ + [ArXiV] + + + +
+ + survey + +

+

In 2006, Geoffrey Hinton proposed the concept of training ‘‘Deep Neural Networks (DNNs)’’ and an improved model training method to break the bottleneck of neural network development. More recently, the introduction of AlphaGo in 2016 demonstrated the powerful learning ability of deep learning and its enormous potential. Deep learning has been increasingly used to develop state-of-the-art software engineering (SE) research tools due to its ability to boost performance for various SE tasks. There are many factors, e.g., deep learning model selection, internal structure differences, and model optimization techniques, that may have an impact on the performance of DNNs applied in SE. Few works to date focus on summarizing, classifying, and analyzing the application of deep learning techniques in SE. To fill this gap, we performed a survey to analyse the relevant studies published since 2006. We first provide an example to illustrate how deep learning techniques are used in SE. We then summarize and classify different deep learning techniques used in SE. We analyzed key optimization technologies used in these deep learning models, and finally describe a range of key research topics using DNNs in SE. Based on our findings, we present a set of current challenges remaining to be investigated and outline a proposed research road map highlighting key opportunities for future work.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

+

Ziyu Yao, Daniel S. Weld, Wei-Peng Chen, Huan Sun. WWW 2018 2018

+

+ + [ArXiV] + + [code] + + + +
+ + dataset + +

+

Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1 and accuracy. Furthermore, we present StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of ∼148K Python and ∼120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning

+

Ziyu Yao, Jayavardhan Reddy Peddamail, Huan Sun. 2019

+

+ + + +
+ + search + +

+

To accelerate software development, much research has been performed +to help people understand and reuse the huge amount of available code +resources. Two important tasks have been widely studied: code retrieval, +which aims to retrieve code snippets relevant to a given natural language +query from a code base, and code annotation, where the goal is to annotate a +code snippet with anatural language description. Despite their advancement in recent +years, the two tasks are mostly explored separately. In this work, we +investigate a novel perspective of Code annotation for Code retrieval +(hence called “CoaCor”), where a code annotation model is trained +to generate a natural language annotation that can represent the +semantic meaning of a given code snippet and can be leveraged by +a code retrieval model to better distinguish relevant code snippets +from others. To this end, we propose an effective framework based +on reinforcement learning, which explicitly encourages the code +annotation model to generate annotations that can be used for the +retrieval task. Through extensive experiments, we show that code +annotations generated by our framework are much more detailed +and more useful for code retrieval, and they can further improve +the performance of existing code retrieval models significantly.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback

+

Michihiro Yasunaga, Percy Liang. 2020

+

+ + [ArXiV] + + + +
+ + repair + + edit + + GNN + +

+

We consider the problem of learning to repair programs from diagnostic feedback (e.g., compiler error messages). Program repair is challenging for two reasons: First, it requires reasoning and tracking symbols across source code and diagnostic feedback. Second, labeled datasets available for program repair are relatively small. In this work, we propose novel solutions to these two challenges. First, we introduce a program-feedback graph, which connects symbols relevant to program repair in source code and diagnostic feedback, and then apply a graph neural network on top to model the reasoning process. Second, we present a self-supervised learning paradigm for program repair that leverages unlabeled programs available online to create a large amount of extra program repair examples, which we use to pre-train our models. We evaluate our proposed approach on two applications: correcting introductory programming assignments (DeepFix dataset) and correcting the outputs of program synthesis (SPoC dataset). Our final system, DrRepair, significantly outperforms prior work, achieving 66.1% full repair rate on DeepFix (+20.8% over the prior best), and 48.0% synthesis success rate on SPoC (+3.3% over the prior best).

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning

+

Wei Ye, Rui Xie, Jinglei Zhang, Tianxiang Hu, Xiaoyin Wang, Shikun Zhang. WWW 2020

+

+ + [ArXiV] + + + +
+ + search + + summarization + +

+

Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connection between the two tasks as they train these tasks in a separate or pipeline manner, which means their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

MISIM: An End-to-End Neural Code Similarity System

+

Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marcus, Nesime Tatbul, Jesmin Jahan Tithi, Paul Petersen, Timothy Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich. 2020

+

+ + [ArXiV] + + + +
+ + code similarity + +

+

Code similarity systems are integral to a range of applications from code recommendation to automated construction of software tests and defect mitigation. In this paper, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters. We compare MISIM to three other state-of-the-art code similarity systems: (i) code2vec, (ii) Neural Code Comprehension, and (iii) Aroma. In our experimental evaluation across 45,780 programs, MISIM consistently outperformed all three systems, often by a large factor (upwards of 40.6x).

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Program Repair with Execution-based Backpropagation

+

He Ye, Matias Martinez, Monperrus Martin. 2021

+

+ + [ArXiV] + + + +
+ + repair + +

+

Neural machine translation (NMT) architectures have achieved promising results for automatic program repair. Yet, they have the limitation of generating low-quality patches (e.g., not compilable patches). This is because the existing works only optimize a purely syntactic loss function based on characters and tokens without incorporating program-specific information during neural net weight optimization. In this paper, we propose a novel program repair model called RewardRepair. The core novelty of RewardRepair is to improve NMT-based program repair with a loss function based on program compilation and test execution information, rewarding the network to produce patches that compile and that do not overfit. We conduct several experiments to evaluate RewardRepair showing that it is feasible and effective to use compilation and test execution results to optimize the underlying neural repair model. In total, RewardRepair correctly repairs 43 Defects4J bugs including eight that are fixed for the first time.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

SelfAPR: Self-supervised Program Repair with Test Execution Diagnostics

+

He Ye, Matias Martinez, Xiapu Luo, Tao Zhang, Martin Monperrus. 2022

+

+ + [ArXiV] + + + +
+ + repair + + execution + +

+

Neural program repair has achieved good results in a recent series of papers. Yet, we observe that the related work fails to repair some bugs because of a lack of knowledge about 1) the program being repaired, and 2) the actual fault being repaired. In this paper, we solve both problems by changing the learning paradigm from supervised training to self-supervised training in an approach called SelfAPR. First, SelfAPR generates and constructs training samples by perturbing a previous version of the program being repaired, enforcing the neural model to capture project-specific knowledge. This is different from all the existing work based on past commits. Second, SelfAPR extracts and encodes test execution diagnostics into the input representation, steering the neural model to fix the specific kind of fault. This is different from the existing studies that only consider static source code in the input. We implement SelfAPR and evaluate it in a systematic manner. We train SelfAPR with 253 411 training samples obtained by perturbing 17 open-source projects. We evaluate SelfAPR on 818 bugs from Defects4J, SelfAPR correctly repairs 112 of them.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Adversarial Examples for Models of Code

+

Noam Yefet, Uri Alon, Eran Yahav. 2019

+

+ + [ArXiV] + + + +
+ + adversarial + +

+

Neural models of code have shown impressive performance for tasks such as predicting method names and identifying certain kinds of bugs. In this paper, we show that these models are vulnerable to adversarial examples, and introduce a novel approach for attacking trained models of code with adversarial examples. The main idea is to force a given trained model to make an incorrect prediction as specified by the adversary by introducing small perturbations that do not change the program’s semantics. To find such perturbations, we present a new technique for Discrete Adversarial Manipulation of Programs (DAMP). DAMP works by deriving the desired prediction with respect to the model’s inputs while holding the model weights constant and following the gradients to slightly modify the code.

+ +

To defend a model against such attacks, we propose placing a defensive model (Anti-DAMP) in front of it. Anti-DAMP detects unlikely mutations and masks them before feeding the input to the downstream model.

+ +

We show that our DAMP attack is effective across three neural architectures: code2vec, GGNN, and GNN-FiLM, in both Java and C#. We show that DAMP has up to 89% success rate in changing a prediction to the adversary’s choice (“targeted attack”), and a success rate of up to 94% in changing a given prediction to any incorrect prediction (“non-targeted attack”). By using Anti-DAMP, the success rate of the attack drops drastically for both targeted and non-targeted attacks, with a minor penalty of 2% relative degradation in accuracy while not performing under attack.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Syntactic Neural Model for General-Purpose Code Generation

+

Pengcheng Yin, Graham Neubig. ACL 2017

+

+ + + +
+ + code generation + + grammar + + bimodal + +

+

We consider the problem of parsing natural language descriptions into source code +written in a general-purpose programming +language like Python. Existing data-driven methods treat this problem as a language generation task without considering +the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture +powered by a grammar model to explicitly +capture the target syntax as prior knowledge. Experiments find this an effective +way to scale up to generation of complex +programs from natural language descriptions, achieving state-of-the-art results that +well outperform previous code generation +and semantic parsing approaches.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

+

Pengcheng Yin, B. Deng, E. Chen, B. Vasilescu, Graham Neubig. MSR 2018

+

+ + [data] + + + +
+ + dataset + +

+

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Represent Edits

+

Pengcheng Yin, Graham Neubig, Miltiadis Allamanis, Marc Brockschmidt, Alexander L. Gaunt. ICLR 2019

+

+ + [ArXiV] + + [data extraction] + + [code edit data] + + + +
+ + edit + +

+

We introduce the problem of learning distributed representations of edits. By combining a +“neural editor” with an “edit encoder”, our models learn to represent the salient +information of an edit and can be used to apply edits to new inputs. +We experiment on natural language and source code edit data. Our evaluation yields +promising results that suggest that our neural network models learn to capture +the structure and semantics of edits. We hope that this interesting task and +data source will inspire other researchers to work further on this problem.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Natural Language to Code Generation in Interactive Data Science Notebooks

+

Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, Charles Sutton. 2022

+

+ + [ArXiV] + + + +
+ + notebook + + evaluation + +

+

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Mercem: Method Name Recommendation Based on Call Graph Embedding

+

Hiroshi Yonai, Yasuhiro Hayase, Hiroyuki Kitagawa. 2019

+

+ + [ArXiV] + + + +
+ + naming + + representation + + refactoring + +

+

Comprehensibility of source code is strongly affected by identifier names, therefore software developers need to give good (e.g. meaningful but short) names to identifiers. On the other hand, giving a good name is sometimes a difficult and time-consuming task even for experienced developers. To support naming identifiers, several techniques for recommending identifier name candidates have been proposed. These techniques, however, still have challenges on the goodness of suggested candidates and limitations on applicable situations. This paper proposes a new approach to recommending method names by applying graph embedding techniques to the method call graph. The evaluation experiment confirms that the proposed technique can suggest more appropriate method name candidates in difficult situations than the state of the art approach.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Abridging Source Code

+

Binhang Yuan, Vijayaraghavan Murali, Christopher Jermaine. OOPSLA 2017

+

+ + [ACM] + + + +
+ + summarization + +

+

In this paper, we consider the problem of source code abridgment, where the goal is to remove statements from a source code in order to display the source code in a small space, while at the same time leaving the ``important’’ parts of the source code intact, so that an engineer can read the code and quickly understand purpose of the code. To this end, we develop an algorithm that looks at a number of examples, human-created source code abridgments, and learns how to remove lines from the code in order to mimic the human abridger. The learning algorithm takes into account syntactic features of the code, as well as semantic features such as control flow and data dependencies. Through a comprehensive user study, we show that the abridgments that our system produces can decrease the time that a user must look at code in order to understand its functionality, as well as increase the accuracy of the assessment, while displaying the code in a greatly reduced area.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning to Execute

+

Wojciech Zaremba, Ilya Sutskever. 2014

+

+ + [ArXiV] + + + +
+ + execution + + representation + +

+

Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequence-to-sequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks. We consider a simple class of programs that can be evaluated with a single left-to-right pass using constant memory. Our main result is that LSTMs can learn to map the character-level representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks’ performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an LSTM to add two 9-digit numbers with 99% accuracy.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

An Extensive Study on Pre-trained Models for Program Understanding and Generation

+

Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, Lingming Zhang. ISSTA 2022

+

+ + [Author Version] + + + +
+ + Transformer + + evaluation + +

+

Automatic program understanding and generation techniques could +significantly advance the productivity of programmers and have +been widely studied by academia and industry. Recently, the advent of pre-trained paradigm enlightens researchers to develop +general-purpose pre-trained models which can be applied for a +broad range of program understanding and generation tasks. Such +pre-trained models, derived by self-supervised objectives on large +unlabelled corpora, can be fine-tuned in downstream tasks (such +as code search and code generation) with minimal adaptations. Although these pre-trained models claim superiority over the prior +techniques, they seldom follow equivalent evaluation protocols, e.g., +they are hardly evaluated on the identical benchmarks, tasks, or settings. Consequently, there is a pressing need for a comprehensive +study of the pre-trained models on their effectiveness, versatility +as well as the limitations to provide implications and guidance for +the future development in this area. To this end, we first perform +an extensive study of eight open-access pre-trained models over +a large benchmark on seven representative code tasks to assess +their reproducibility. We further compare the pre-trained models +and domain-specific state-of-the-art techniques for validating pre-trained effectiveness. At last, we investigate the robustness of the +pre-trained models by inspecting their performance variations under adversarial attacks. Through the study, we find that while we +can in general replicate the original performance of the pre-train +models on their evaluated tasks and adopted benchmarks, subtle +performance fluctuations can refute the findings in their original +papers. Moreover, none of the existing pre-trained models can dominate over all other models. We also find that the pre-trained models +can significantly outperform non-pre-trained state-of-the-art techniques in program understanding tasks. Furthermore, we perform +the first study for natural language-programming language pre-trained model robustness via adversarial attacks and find that a +simple random attack approach can easily fool the state-of-the-art +pre-trained models and thus incur security issues. At last, we also +provide multiple practical guidelines for advancing future research +on pre-trained models for program understanding and generation.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Learning Uniform Semantic Features for Natural Language and Programming Language Globally, Locally and Sequentially

+

Yudong Zhang, Wenhao Zheng, Ming Li. AAAI 2019

+

+ + + +
+ + representation + + bimodal + +

+

Semantic feature learning for natural language and programming language is a preliminary step in addressing many software mining tasks. Many existing methods leverage +information in lexicon and syntax to learn features for textual data. +However, such information is inadequate to represent the entire semantics in either text sentence or code snippet. This +motivates us to propose a new approach to learn semantic +features for both languages, through extracting three levels of +information, namely global, local and sequential information, +from textual data. For tasks involving both modalities, we +project the data of both types into a uniform feature space so +that the complementary knowledge in between can be utilized +in their representation. In this paper, we build a novel and +general-purpose feature learning framework called UniEmbed, to uniformly learn comprehensive semantic representation for both natural language and programming language. +Experimental results on three real-world software mining +tasks show that UniEmbed outperforms state-of-the-art models in feature learning and prove the capacity and effectiveness of our model.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Novel Neural Source Code Representation based on Abstract Syntax Tree

+

Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, Xudong Liu. ICSE 2019

+

+ + [PDF] + + + +
+ + representation + + grammar + +

+

Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Generating Adversarial Examples for Holding Robustness of Source Code Processing Models

+

Huangzhao Zhang, Zhuo Li, Ge Li, Lei Ma, Yang Liu, Zhi Jin. AAAI 2020

+

+ + [Proceedings] + + + +
+ + adversarial + +

+

Automated processing, analysis, and generation of source code are among the key activities +in software and system life-cycle. To this end, while deep learning (DL) exhibits a certain level +of capability in handling these tasks, the current state-of-the-art DL models still suffer from +non-robust issues and can be easily fooled by adversarial attacks.

+ +

Different from adversarial +attacks for image, audio, andnatural languages, the structured nature of programming +languages brings new challenges. In this paper, we propose a Metropolis-Hastings +sampling-based identifier renaming technique, named Metropolis-Hastings Modifier (MHM), +which generates adversarial examples for DL models specialized for source code processing. +Our in-depth evaluation on a functionality classification benchmark demonstrates the +effectiveness of MHM in generating adversarial examples of source code. The higher robustness +and performance enhanced through our adversarial training with MHM further confirms the usefulness +of DL models-based method for future fully automated source code processing.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Bag-of-Words Baselines for Semantic Code Search

+

Xinyu Zhang, Ji Xin, Andrew Yates, Jimmy Lin. NLP4Prog 2021

+

+ + [PDF] + + + +
+ + search + +

+

The task of semantic code search is to retrieve code snippets from a source code corpus based on an information need expressed in natural language. The semantic gap between natural language and programming languages has for long been regarded as one of the most significant obstacles to the effectiveness of keyword-based information retrieval (IR) methods. It is a common assumption that “traditional” bag-of-words IR methods are poorly suited for semantic code search: our work empirically investigates this assumption. Specifically, we examine the effectiveness of two traditional IR methods, namely BM25 and RM3, on the CodeSearchNet Corpus, which consists of natural language queries paired with relevant code snippets. We find that the two keyword-based methods outperform several pre-BERT neural models. We also compare several code-specific data pre-processing strategies and find that specialized tokenization improves effectiveness.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Disentangled Code Representation Learning for Multiple Programming Languages

+

Jingfeng Zhang, Haiwen Hong, Yin Zhang, Yao Wan, Ye Liu, Yulei Sui. ACL 2021

+

+ + [Proceedings] + + + +
+ + representation + +

+

Developing effective distributed representations of source code is fundamental yet challenging for many software engineering tasks such as code clone detection, code search, code translation and transformation. However, current code embedding approaches that represent the semantic and syntax of code in a mixed way are less interpretable and the resulting embedding can not be easily generalized across programming languages. In this paper, we propose a disentangled code representation learning approach to separate the semantic from the syntax of source code under a multi-programming-language setting, obtaining better interpretability and generalizability. Specially, we design three losses dedicated to the characteristics of source code to enforce the disentanglement effectively. We conduct comprehensive experiments on a real-world dataset composed of programming exercises implemented by multiple solutions that are semantically identical but grammatically distinguished. The experimental results validate the superiority of our proposed disentangled code representation, compared to several baselines, across three types of downstream tasks, i.e., code clone detection, code translation, and code-to-code search.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CoditT5: Pretraining for Source Code and Natural Language Editing

+

Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, Milos Gligoric. 2022

+

+ + [ArXiV] + + + +
+ + Transformer + + edit + +

+

Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming pure generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a pure generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

+

Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, Weizhu Chen. 2023

+

+ + [ArXiV] + + [Code] + + + +
+ + completion + + Transformer + + retrieval + +

+

The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model, which allows for the effective utilization of repository-level information for code completion and grants the ability to generate code at various levels of granularity. Furthermore, RepoCoder utilizes a novel iterative retrieval-generation paradigm that bridges the gap between retrieval context and the intended completion target. We also propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder by using various combinations of code retrievers and generators. Experimental results indicate that RepoCoder significantly improves the zero-shot code completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural-Augumented Static Analysis of Android Communication

+

Jinman Zhao, Aws Albarghouthi, Vaibhav Rastogi, Somesh Jha, Damien Octeau. FSE 2018

+

+ + [ArXiV] + + + +
+ + program analysis + +

+

We address the problem of discovering communication links between applications in the popular Android mobile operating system, an important problem for security and privacy in Android. Any scalable static analysis in this complex setting is bound to produce an excessive amount of false-positives, rendering it impractical. To improve precision, we propose to augment static analysis with a trained neural-network model that estimates the probability that a communication link truly exists. We describe a neural-network architecture that encodes abstractions of communicating objects in two applications and estimates the probability with which a link indeed exists. At the heart of our architecture are type-directed encoders (TDE), a general framework for elegantly constructing encoders of a compound data type by recursively composing encoders for its constituent types. We evaluate our approach on a large corpus of Android applications, and demonstrate that it achieves very high accuracy. Further, we conduct thorough interpretability studies to understand the internals of the learned neural networks.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Neural Networks for Modeling Source Code Edits

+

Rui Zhao, David Bieber, Kevin Swersky, Daniel Tarlow. 2019

+

+ + [OpenReview] + + [ArXiV] + + + +
+ + edit + +

+

Programming languages are emerging as a challenging and interesting domain for machine learning. A core task, which has received significant attention in recent years, is building generative models of source code. However, to our knowledge, previous generative models have always been framed in terms of generating static snapshots of code. In this work, we instead treat source code as a dynamic object and tackle the problem of modeling the edits that software developers make to source code files. This requires extracting intent from previous edits and leveraging it to generate subsequent edits. We develop several neural networks and use synthetic data to test their ability to learn challenging edit patterns that require strong generalization. We then collect and train our models on a large-scale dataset of Google source code, consisting of millions of fine-grained edits from thousands of Python developers. From the modeling perspective, our main conclusion is that a new composition of attentional and pointer network components provides the best overall performance and scalability. From the application perspective, our results provide preliminary evidence of the feasibility of developing tools that learn to predict future edits.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Generating Regular Expressions from Natural Language Specifications: Are We There Yet?

+

Zexuan Zhong, Jiaqi Guo, Wei Yang, Tao Xie, Jian-Guang Lou, Ting Liu, Dongmei Zhang. NLSE 2018

+

+ + [PDF] + + + +
+ + bimodal + + code generation + +

+

Recent state-of-the-art approaches automatically generate +regular expressions from natural language specifications. +Given that these approaches use only synthetic data in both +training datasets and validation/test datasets, a natural question arises: are these approaches effective to address various +real-world situations? To explore this question, in this paper, we conduct a characteristic study on comparing two synthetic datasets used by the recent research and a real-world +dataset collected from the Internet, and conduct an experimental study on applying a state-of-the-art approach on the +real-world dataset. Our study results suggest the existence of +distinct characteristics between the synthetic datasets and the +real-world dataset, and the state-of-the-art approach (based +on a model trained from a synthetic dataset) achieves extremely low effectiveness when evaluated on real-world data, +much lower than the effectiveness when evaluated on the synthetic dataset. We also provide initial analysis on some of +those challenging cases and discuss future directions.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Semantic Scaffolds for Pseudocode-to-Code Generation

+

Ruiqi Zhong, Mitchell Stern, Dan Klein. 2020

+

+ + [ArXiV] + + + +
+ + code generation + + synthesis + +

+

We propose a method for program generation based on semantic scaffolds, lightweight structures representing the high-level semantic and syntactic composition of a program. By first searching over plausible scaffolds then using these as constraints for a beam search over programs, we achieve better coverage of the search space when compared with existing techniques. We apply our hierarchical search method to the SPoC dataset for pseudocode-to-code generation, in which we are given line-level natural language pseudocode annotations and aim to produce a program satisfying execution-based test cases. By using semantic scaffolds during inference, we achieve a 10% absolute improvement in top-100 accuracy over the previous state-of-the-art. Additionally, we require only 11 candidates to reach the top-3000 performance of the previous best approach when tested against unseen problems, demonstrating a substantial improvement in efficiency.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

+

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, Yang Liu. NeurIPS 2020

+

+ + [Paper] + + + +
+ + GNN + + static analysis + +

+

Vulnerability identification is crucial to protect the software systems from attacks for cyber security. It is especially important to localize the vulnerable functions among the source code to facilitate the fix. However, it is a challenging and tedious process, and also requires specialized security expertise. Inspired by the work on manually-defined patterns of vulnerabilities from various code representation graphs and the recent advance on graph neural networks, we propose Devign, a general graph neural network based model for graph-level classification through learning on a rich set of code semantic representations. It includes a novel Conv module to efficiently extract useful features in the learned rich node representations for graph-level classification. The model is trained over manually labeled datasets built on 4 diversified large-scale open-source C projects that incorporate high complexity and variety of real source code instead of synthesis code used in previous works. The results of the extensive evaluation on the datasets demonstrate that Devign outperforms the state of the arts significantly with an average of 10.51% higher accuracy and 8.68% F1 score, increases averagely 4.66% accuracy and 6.37% F1 by the Conv module.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Improving Code Autocompletion with Transfer Learning

+

Wen Zhou, Seohyun Kim, Vijayaraghavan Murali, Gareth Ari Aye. 2021

+

+ + [ArXiV] + + + +
+ + autocomplete + + Transformer + +

+

Software language models have achieved promising results predicting code completion usages, and several industry studies have described successful IDE integrations. Recently, accuracy in autocompletion prediction improved 12.8% from training on a real-world dataset collected from programmers’ IDE activity. But what if limited examples of IDE autocompletion in the target programming language are available for model training? In this paper, we investigate the efficacy of pretraining autocompletion models on non-IDE, non-autocompletion, and different-language example code sequences. We find that these unsupervised pretrainings improve model accuracy by over 50% on very small fine-tuning datasets and over 10% on 50k labeled examples. We confirm the real-world impact of these pretrainings in an online setting through A/B testing on thousands of IDE autocompletion users, finding that pretraining is responsible for increases of up to 6.63% autocompletion usage.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

+

Shuyan Zhou, Uri Alon, Sumit Agarwal, Graham Neubig. 2023

+

+ + [ArXiV] + + [Code] + + + +
+ + evaluation + + Transformer + +

+

Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

DocCoder: Generating Code by Retrieving and Reading Docs

+

Shuyan Zhou, Uri Alon, Frank F. Xu, Zhengbao JIang, Graham Neubig. 2022

+

+ + [ArXiV] + + [Code and Data] + + + +
+ + Transformer + + search + + code generation + +

+

Natural-language-to-code models learn to generate a code snippet given a natural language (NL) intent. However, the rapid growth of both publicly available and proprietary libraries and functions makes it impossible to cover all APIs using training examples, as new libraries and functions are introduced daily. Thus, existing models inherently cannot generalize to using unseen functions and libraries merely through incorporating them into the training data. In contrast, when human programmers write programs, they frequently refer to textual resources such as code manuals, documentation, and tutorials, to explore and understand available library functionality. Inspired by this observation, we introduce DocCoder: an approach that explicitly leverages code manuals and documentation by (1) retrieving the relevant documentation given the NL intent, and (2) generating the code based on the NL intent and the retrieved documentation. Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model. We demonstrate that DocCoder consistently improves NL-to-code models: DocCoder achieves 11x higher exact match accuracy than strong baselines on a new Bash dataset tldr; on the popular Python CoNaLa benchmark, DocCoder improves over strong baselines by 1.65 BLEU.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

OCoR: An Overlapping-Aware Code Retriever

+

Qihao Zhu, Zeyu Sun, Xiran Liang, Yingfei Xiong, Lu Zhang. ASE 2020

+

+ + [ArXiV] + + + +
+ + search + +

+

Code retrieval helps developers reuse the code snippet in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code among a set of code. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., “message” and “msg”), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related. To address these problems, we propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps: the first embeds identifiers by character to capture the overlaps between identifiers, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier. +The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of different components in OCoR.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

A Syntax-Guided Edit Decoder for Neural Program Repair

+

Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, Lu Zhang. FSE 2021

+

+ + [ArXiV] + + + +
+ + edit + +

+

Automated Program Repair (APR) helps improve the efficiency of software development and maintenance. Recent APR techniques use deep learning, particularly the encoder-decoder architecture, to generate patches. +Though existing DL-based APR approaches have proposed different encoder architectures, the decoder remains to be the standard one, which generates a sequence of tokens one by one to replace the faulty statement. +This decoder has multiple limitations: 1) allowing to generate syntactically incorrect programs, 2) inefficiently representing small edits, and 3) not being able to generate project-specific identifiers. +In this paper, we propose Recoder, a syntax-guided edit decoder with placeholder generation. Recoder is novel in multiple aspects: 1) Recoder generates edits rather than modified code, allowing efficient representation of small edits; 2) Recoder is syntax-guided, with the novel provider/decider architecture to ensure the syntactic correctness of the patched program and accurate generation; 3) Recoder generates placeholders that could be instantiated as project-specific identifiers later. +We conduct experiments to evaluate Recoder on 395 bugs from Defects4J v1.2, 420 additional bugs from Defects4J v2.0, 297 bugs from IntroClassJava and 40 bugs from QuixBugs. Our results show that Recoder repairs 53 bugs on Defects4J v1.2, which achieves 26.2% (11 bugs) improvement over the previous state-of-the-art approach for single-hunk bugs (TBar). Importantly, to our knowledge, Recoder is the first DL-based APR approach that has outperformed the traditional APR approaches on this benchmark.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Productivity Assessment of Neural Code Completion

+

Albert Ziegler, Eirini Kalliamvakou, Shawn Simister, Ganesh Sittampalam, Alice Li, Andrew Rice, Devon Rifkin, Edward Aftandilian. MAPS 2022

+

+ + [ArXiV] + + [Data] + + + +
+ + evaluation + + human evaluation + +

+

Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers’ productivity, without being able to measure it directly. In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data. We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers’ perception of productivity.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Exploring and Evaluating Personalized Models for Code Generation

+

Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, Michele Tufano. FSE 2022

+

+ + [ArXiV] + + + +
+ + Transformer + +

+

Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain – for example, question-answering on a given topic – generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model’s parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Language-Agnostic Representation Learning of Source Code from Structure and Context

+

Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, Stephan Günnemann. ICLR 2021

+

+ + [ArXiV] + + + +
+ + Transformer + + representation + +

+

Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.

+

+ +

Similar Work

+

+ + +

+ + + +

+ +

+

Presentations and Relevant Introductory Material

+ +

Tutorial: An Introduction to Learning from Programs by Marc Brockschmidt in VMCAI Winter School 2019 [slides].
Tutorial: Modelling Natural Language, Programs, and their Intersection in NAACL HLT 2018, 1 June 2018, New Orleans, LA, USA [slides] [video]

+ +

Datasets

+

Some resources about Big Code and Naturalness can be found at learnbigcode.github.io. +A list of datasets used in this area can be found at the appendix of the +survey and at learnbigcode.github.io.

+ +

Courses

+

A few university courses are been taught covering aspects of machine learning for code, big code or naturalness of code. Below there are a few that have publicly available material.

+

Analyzing Software using Deep Learning in University of Stuttgart [videos]
Seminars on Applications of Deep Learning in Software Engineering and Programming Languages in U.C. Berkeley
Machine learning for programming in the University of Cambridge, UK
Deep Learning for Symbolic Reasoning in Purdue University
Machine Learning for Software Engineering in TU Delft

+ +

Please, feel free to submit a pull request to adding more links in this page.

+ +

Workshops and Other Academic Events

+

The last few years a few workshops have been organized in this area. Please, feel free to add any missing or future workshops here.

+ +

Deep Learning for Code April 29 2022, ICLR 2022, virtual
NLP4Prog Workshop 6 August 2021, ACL 2021, virtual
Workshop on Computer-Assisted Programming 12 December 2020, NeurIPS 2020, virtual
ML on Code devroom at FOSDEM19 2-3 February 2019, Brussels, EU [videos]
Machine Learning for Programming 18–19 July 2018, Oxford, UK [videos]
International Workshop on Machine Learning techniques for Programming Languages 16 - 21 July 2018 Amsterdam, Netherlands
Workshop on Machine Learning and Programming Languages in PLDI 18 - 22 June 2018, Philadelphia, PA, USA
Workshop on NLP for Software Engineering 4 February 2018, New Orleans, LA, USA
The 55th CREST Open Workshop - Bimodal Program Analysis 30-31 October 2017, London, UK
Workshop on NLP for Software Engineering 13 November 2016, Seattle, WA, USA
Programming with “Big Code” 15-18 November 2015, Dagstuhl, Germany

+ +

Courses on Important Relevant Background

+ +

Sofware Analysis at Univ. of Pennsylvania. It is a great introduction to Program Analysis [videos]
Program Analysis at University of Stuttgart [videos]
Applications of Data Science for Software Engineering 2020 at Eindhoven University of Technology.

+ +

Competitions

+

nlc2cmd in NeurIPS 2020 by Project CLAI. Starts July 2020.
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search by Github. Starts Sep 2019.
CodRep 2019: Machine Learning on Source Code Competition by KTH. Starts on April 25th 2019.
CodRep 2018: Machine Learning on Source Code Competition by KTH. Starts on April 14th 2018.

+ + +

source{d} has collected a set of links and +papers in the area. You can access the list here.
Autormated Program Repair +has a curated list of pointers for helping newcomers to understan the field, +maintained by Martin Monperrus.

+ +

+ + +

Publications by Tag

The following tags appear in the publications listed in the review:

-{% for tag in rawtags %}{{ tag }} {% endfor %} +adversarial API autocomplete benchmark benchmarking bimodal Binary Code clone code completion code generation code similarity compilation completion cybersecurity dataset decompilation defect deobfuscation documentation dynamic edit editing education evaluation execution feature location fuzzing generalizability generation GNN grammar human evaluation information extraction instruction tuning interpretability language model large language models LLM logging memorization metrics migration naming natural language generation natural language processing notebook optimization pattern mining plagiarism detection pretraining program analysis program synthesis question answering refactoring repair representation retrieval Reverse Engineering review search static static analysis style summarization survey synthesis test generation tool topic modeling topic modelling traceability Transformer Transformers translation types variable misuse verification vulnerability

Topic-based Explorer

Using topic-modelling the following topics have been extracted. The top stemmed words apprear below. Please change the slider to present the papers that mostly related to the appropria topics

@@ -58,4 +153,8 @@

Topic-based Explorer

$("#toppapers").append("

"+ data.title +". " + data.year + "

"); } } - \ No newline at end of file + +

+ + + diff --git a/topics.json b/topics.json new file mode 100644 index 00000000..75d3b9fd --- /dev/null +++ b/topics.json @@ -0,0 +1 @@ +{"topics": [["dataset", "performance"], ["neural", "search", "network", "query"], ["method", "test", "case"], ["representation", "semantic", "clone", "feature"], ["task", "generation", "transformer", "programming"], ["bug", "analysis", "program", "vulnerability"], ["software", "text", "library", "statement"], ["api", "error", "function", "vector"], ["pattern", "semantic", "new"], ["type", "static", "inference"], ["completion", "context", "comment", "suggestion"], ["change", "message", "generation", "automatically"], ["program", "graph", "neural", "representation"], ["name", "method", "variable", "naming"], ["translation", "machine", "python", "method"], ["repair", "bug", "fix", "program"], ["class", "embeddings"], ["program", "deep", "optimization", "compiler"], ["program", "system", "programming", "synthesis"], ["natural", "summarization", "task", "description"]], "paper_data": [{"key": "abdelaziz2020graph4code", "year": "2020", "title": "Graph4Code: A Machine Interpretable Knowledge Graph for Code", "topic_distr": {"0": 0.07088343799114227, "1": 0.0012124303029850125, "2": 0.0010248323669657111, "3": 0.0008876280626282096, "4": 0.0007828403613530099, "5": 0.0007001928170211613, "6": 0.0006333300843834877, "7": 0.000578124076128006, "8": 0.0005317708128131926, "9": 0.2450011521577835, "10": 0.00045828186557628214, "11": 0.0004286621115170419, "12": 0.00040263860137201846, "13": 0.00037959395558573306, "14": 0.00035904443939216435, "15": 0.00034060553298331797, "16": 0.6745088696479797, "17": 0.00030888020410202444, "18": 0.0002951351925730705, "19": 0.0002825613191816956}}, {"key": "agashe2019julce", "year": "2019", "title": "JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation", "topic_distr": {"0": 0.2759517729282379, "1": 0.0018515449482947588, "2": 0.09552232921123505, "3": 0.0013556184712797403, "4": 0.3181428611278534, "5": 0.001069381134584546, "6": 0.0009672642336227, "7": 0.09237674623727798, "8": 0.0008121561259031296, "9": 0.0007518719648942351, "10": 0.0006999188335612416, "11": 0.0006546815275214612, "12": 0.0006149366963654757, "13": 0.0005797413759864867, "14": 0.0005483567947521806, "15": 0.0005201956373639405, "16": 0.0004947857814840972, "17": 0.0004717425908893347, "18": 0.0004507502890191972, "19": 0.20616331696510315}}, {"key": "aggarwal2015using", "year": "2015", "title": "Using Machine Translation for Converting Python 2 to Python 3 Code", "topic_distr": {"0": 0.003198413411155343, "1": 0.002610740251839161, "2": 0.002207281067967415, "3": 0.0019117265474051237, "4": 0.0016860729083418846, "5": 0.001508068758994341, "6": 0.001364060677587986, "7": 0.001245158608071506, "8": 0.001145323272794485, "9": 0.0010603091213852167, "10": 0.0009870434878394008, "11": 0.0009232487063854933, "12": 0.0008671995019540191, "13": 0.0008175661787390709, "14": 0.9751269221305847, "15": 0.0007335932459682226, "16": 0.0006977595621719956, "17": 0.0006652635056525469, "18": 0.0006356596131809056, "19": 0.0006085780914872885}}, {"key": "agrawal2023monitor", "year": "2023", "title": "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context", "topic_distr": {"0": 0.3650084435939789, "1": 0.001272849040105939, "2": 0.0010760932927951217, "3": 0.0009320022654719651, "4": 0.0008219893788918853, "5": 0.0007352068205364048, "6": 0.0006650005816482008, "7": 0.11699128895998001, "8": 0.0005583626916632056, "9": 0.297407329082489, "10": 0.21125425398349762, "11": 0.0004500978975556791, "12": 0.0004227730678394437, "13": 0.0003985760558862239, "14": 0.0003769989125430584, "15": 0.00035763796768151224, "16": 0.0003401684807613492, "17": 0.0003243261598981917, "18": 0.0003098938032053411, "19": 0.0002966911706607789}}, {"key": "ahmad2020transformer", "year": "2020", "title": "A Transformer-based Approach for Source Code Summarization", "topic_distr": {"0": 0.418683260679245, "1": 0.002167036524042487, "2": 0.0018316010246053338, "3": 0.00158640556037426, "4": 0.0013991354499012232, "5": 0.0012514236150309443, "6": 0.0011319229379296303, "7": 0.001033255597576499, "8": 0.0009504104964435101, "9": 0.0008798640919849277, "10": 0.2022024542093277, "11": 0.0007661288254894316, "12": 0.19639898836612701, "13": 0.000678431533742696, "14": 0.0006417042459361255, "15": 0.000608749280218035, "16": 0.0005790137802250683, "17": 0.0005520479753613472, "18": 0.0005274820723570883, "19": 0.1661306917667389}}, {"key": "ahmad2021unified", "year": "2021", "title": "Unified Pre-training for Program Understanding and Generation", "topic_distr": {"0": 0.002113898051902652, "1": 0.0017260139575228095, "2": 0.0014590692007914186, "3": 0.13174130022525787, "4": 0.4038344621658325, "5": 0.0009969058446586132, "6": 0.0009017095435410738, "7": 0.0008231094689108431, "8": 0.0007571136229671538, "9": 0.0007009151158854365, "10": 0.0006524830823764205, "11": 0.0006103116320446134, "12": 0.1600915789604187, "13": 0.0005404503899626434, "14": 0.12556742131710052, "15": 0.05255412682890892, "16": 0.00046125249355100095, "17": 0.0004397710436023772, "18": 0.11362578719854355, "19": 0.0004022992798127234}}, {"key": "ahmed2019learning", "year": "2019", "title": "Learning Lenient Parsing & Typing via Indirect Supervision", "topic_distr": {"0": 0.3593815267086029, "1": 0.0011705803917720914, "2": 0.05253699794411659, "3": 0.0008570475620217621, "4": 0.0007558610523119569, "5": 0.0006760613759979606, "6": 0.0006115027354098856, "7": 0.1184670478105545, "8": 0.0005134436069056392, "9": 0.30700382590293884, "10": 0.0004424874496180564, "11": 0.00041388848330825567, "12": 0.00038876186590641737, "13": 0.00036651146365329623, "14": 0.0003466701600700617, "15": 0.1548989862203598, "16": 0.0003128026728518307, "17": 0.00029823483782820404, "18": 0.0002849635202437639, "19": 0.00027282300288788974}}, {"key": "ahmed2022learning", "year": "2022", "title": "Learning code summarization from a small and local dataset", "topic_distr": {"0": 0.9846402406692505, "1": 0.0017865687841549516, "2": 0.0015102287288755178, "3": 0.0013080688659101725, "4": 0.0011536659440025687, "5": 0.0010318682761862874, "6": 0.0009333334746770561, "7": 0.0008519768016412854, "8": 0.000783666386269033, "9": 0.0007254969095811248, "10": 0.0006753663183189929, "11": 0.0006317158695310354, "12": 0.0005933652864769101, "13": 0.0005594045505858958, "14": 0.000529120909050107, "15": 0.0005019476520828903, "16": 0.0004774291010107845, "17": 0.0004551942693069577, "18": 0.00043493835255503654, "19": 0.00041640832205303013}}, {"key": "ahmed2024studying", "year": "2024", "title": "Studying LLM Performance on Closed- and Open-source Data", "topic_distr": {"0": 0.7309610843658447, "1": 0.17183436453342438, "2": 0.0015372595516964793, "3": 0.043363429605960846, "4": 0.0011742698261514306, "5": 0.0010502971708774567, "6": 0.0009500025771558285, "7": 0.0008671928662806749, "8": 0.0007976624765433371, "9": 0.0007384541095234454, "10": 0.0006874282262288034, "11": 0.04199986159801483, "12": 0.0006039626314304769, "13": 0.0005693954299204051, "14": 0.0005385708645917475, "15": 0.0005109123303554952, "16": 0.00048595588305033743, "17": 0.000463323958683759, "18": 0.00044270625221543014, "19": 0.0004238452820573002}}, {"key": "ahmed2033improving", "year": "2023", "title": "Improving Few-Shot Prompts with Relevant Static Analysis Products", "topic_distr": {"0": 0.5228775143623352, "1": 0.0009518204024061561, "2": 0.0008045544964261353, "3": 0.0006968331290408969, "4": 0.0006145767401903868, "5": 0.0005496931844390929, "6": 0.0004972020396962762, "7": 0.0004538619832601398, "8": 0.00041747192153707147, "9": 0.14876846969127655, "10": 0.0003597786999307573, "11": 0.00033652540878392756, "12": 0.0974959135055542, "13": 0.0002980039862450212, "14": 0.019535601139068604, "15": 0.00026739572058431804, "16": 0.05971526727080345, "17": 0.00024248944828286767, "18": 0.00023169878113549203, "19": 0.1448853462934494}}, {"key": "alet2021largescale", "year": "2021", "title": "A large-scale benchmark for few-shot program induction and synthesis", "topic_distr": {"0": 0.49097418785095215, "1": 0.1295812577009201, "2": 0.08591875433921814, "3": 0.0010651455959305167, "4": 0.0009394112275913358, "5": 0.000840232998598367, "6": 0.0007599978125654161, "7": 0.0006937503931112587, "8": 0.082987479865551, "9": 0.0005907600279897451, "10": 0.0005499394610524178, "11": 0.0005143956514075398, "12": 0.00048316738684661686, "13": 0.00045551377115771174, "14": 0.000430854270234704, "15": 0.000408727559261024, "16": 0.0003887625352945179, "17": 0.0003706570714712143, "18": 0.20170797407627106, "19": 0.0003390743222553283}}, {"key": "allal2022santacoder", "year": "2022", "title": "SantaCoder: don\u2019t reach for the stars!", "topic_distr": {"0": 0.8992711901664734, "1": 0.0024833122733980417, "2": 0.0020996439270675182, "3": 0.0018184887012466788, "4": 0.0016038415487855673, "5": 0.0014345147646963596, "6": 0.0012975306017324328, "7": 0.0011844276450574398, "8": 0.001089461729861796, "9": 0.08038440346717834, "10": 0.0009389017941430211, "11": 0.000878218503203243, "12": 0.0008249030215665698, "13": 0.0007776904967613518, "14": 0.0007355897687375546, "15": 0.0006978132296353579, "16": 0.0006637272890657187, "17": 0.0006328161689452827, "18": 0.0006046561757102609, "19": 0.0005788955022580922}}, {"key": "allamanis2013mining", "year": "2013", "title": "Mining Source Code Repositories at Massive Scale Using Language Modeling ", "topic_distr": {"0": 0.4560050368309021, "1": 0.0015912364469841123, "2": 0.0013450579717755318, "3": 0.001164990826509893, "4": 0.0010274782544001937, "5": 0.0009190035634674132, "6": 0.29781877994537354, "7": 0.0007587882573716342, "8": 0.0006979495519772172, "9": 0.0006461426382884383, "10": 0.0006014952668920159, "11": 0.0005626193014904857, "12": 0.0005284634535200894, "13": 0.23382586240768433, "14": 0.0004712460795417428, "15": 0.00044704502215608954, "16": 0.00042520827264524996, "17": 0.00040540550253354013, "18": 0.00038736514397896826, "19": 0.0003708619042299688}}, {"key": "allamanis2014learning", "year": "2014", "title": "Learning Natural Coding Conventions", "topic_distr": {"0": 0.0020448907744139433, "1": 0.0016694333171471953, "2": 0.0014111896743997931, "3": 0.0012222951045259833, "4": 0.0010780119337141514, "5": 0.0009642018703743815, "6": 0.0008721285848878324, "7": 0.0007961070514284074, "8": 0.0007322761812247336, "9": 0.0006779212853871286, "10": 0.7658169865608215, "11": 0.0005902901175431907, "12": 0.0005544543964788318, "13": 0.2189393788576126, "14": 0.0004944229149259627, "15": 0.00046903162728995085, "16": 0.00044612091733142734, "17": 0.0004253441875334829, "18": 0.00040641656960360706, "19": 0.0003891016822308302}}, {"key": "allamanis2014mining", "year": "2014", "title": "Mining Idioms from Source Code", "topic_distr": {"0": 0.0018893310334533453, "1": 0.001543002319522202, "2": 0.0013042653445154428, "3": 0.001129685784690082, "4": 0.0009963358752429485, "5": 0.0008911492768675089, "6": 0.8328647613525391, "7": 0.0007357901195064187, "8": 0.15346352756023407, "9": 0.0006265587289817631, "10": 0.000583264569286257, "11": 0.0005455669015645981, "12": 0.0005124462768435478, "13": 0.00048311689170077443, "14": 0.00045696308370679617, "15": 0.00043349553016014397, "16": 0.00041232065996155143, "17": 0.00039311806904152036, "18": 0.00037562448414973915, "19": 0.00035962148103863}}, {"key": "allamanis2015bimodal", "year": "2015", "title": "A Bimodal Modelling of Source Code and Natural Language", "topic_distr": {"0": 0.0029692112002521753, "1": 0.13666760921478271, "2": 0.0020496135111898184, "3": 0.001775212469510734, "4": 0.0015656740870326757, "5": 0.0014003795804455876, "6": 0.19510884582996368, "7": 0.0011562436120584607, "8": 0.0010635374346747994, "9": 0.0009845939930528402, "10": 0.0009165601804852486, "11": 0.0008573208469897509, "12": 0.06825180351734161, "13": 0.0007591849425807595, "14": 0.0007180860848166049, "15": 0.0006812084466218948, "16": 0.0006479335715994239, "17": 0.0006177580216899514, "18": 0.0005902680568397045, "19": 0.581218957901001}}, {"key": "allamanis2015suggesting", "year": "2015", "title": "Suggesting Accurate Method and Class Names", "topic_distr": {"0": 0.0014520895201712847, "1": 0.0011842730455100536, "2": 0.04267728701233864, "3": 0.07110925763845444, "4": 0.05749662220478058, "5": 0.0006839247071184218, "6": 0.0006186153623275459, "7": 0.0005646919598802924, "8": 0.16111987829208374, "9": 0.0004808608500752598, "10": 0.0004476341709960252, "11": 0.0004187026061117649, "12": 0.0003932837280444801, "13": 0.4074627161026001, "14": 0.00035070240846835077, "15": 0.00033269193954765797, "16": 0.25234082341194153, "17": 0.00030170369427651167, "18": 0.0002882780390791595, "19": 0.00027599630993790925}}, {"key": "allamanis2016convolutional", "year": "2016", "title": "A Convolutional Attention Network for Extreme Summarization of Source Code", "topic_distr": {"0": 0.0018614925211295485, "1": 0.7520753145217896, "2": 0.0012848025653511286, "3": 0.0011128397891297936, "4": 0.0009814701043069363, "5": 0.0008778529590927064, "6": 0.0007940252544358373, "7": 0.0007248118054121733, "8": 0.0006666972767561674, "9": 0.0006172101711854339, "10": 0.0005745620001107454, "11": 0.0005374267348088324, "12": 0.0005048002931289375, "13": 0.000475908542284742, "14": 0.23496609926223755, "15": 0.0004270275530871004, "16": 0.00040616863407194614, "17": 0.00038725254125893116, "18": 0.0003700199886225164, "19": 0.00035425572423264384}}, {"key": "allamanis2017mining", "year": "2017", "title": "Mining Semantic Loop Idioms from Big Code", "topic_distr": {"0": 0.0017321386840194464, "1": 0.0014143368462100625, "2": 0.001195550081320107, "3": 0.0010355368722230196, "4": 0.0009133002604357898, "5": 0.0008168800268322229, "6": 0.000738874776288867, "7": 0.0006744686979800463, "8": 0.9867287278175354, "9": 0.0005743406945839524, "10": 0.0005346547113731503, "11": 0.0005000988021492958, "12": 0.00046973847202025354, "13": 0.00044285343028604984, "14": 0.0004188793245702982, "15": 0.00039736757753416896, "16": 0.00037795741809532046, "17": 0.0003603552177082747, "18": 0.00034431956009939313, "19": 0.0003296502400189638}}, {"key": "allamanis2017smartpaste", "year": "2017", "title": "SmartPaste: Learning to Adapt Source Code", "topic_distr": {"0": 0.0020134099759161472, "1": 0.001642551738768816, "2": 0.0013884877553209662, "3": 0.17101162672042847, "4": 0.2323119342327118, "5": 0.0009486556518822908, "6": 0.0008580668945796788, "7": 0.0007832710980437696, "8": 0.03738534078001976, "9": 0.0006669908761978149, "10": 0.0006209029816091061, "11": 0.0005807726411148906, "12": 0.26542750000953674, "13": 0.18123076856136322, "14": 0.07625972479581833, "15": 0.025229869410395622, "16": 0.00043892793473787606, "17": 0.00041848619002848864, "18": 0.0003998637548647821, "19": 0.0003828280314337462}}, {"key": "allamanis2018learning", "year": "2018", "title": "Learning to Represent Programs with Graphs", "topic_distr": {"0": 0.0015211785212159157, "1": 0.0012420057319104671, "2": 0.0010498291812837124, "3": 0.09247496724128723, "4": 0.0008019358501769602, "5": 0.11273577064275742, "6": 0.000648778339382261, "7": 0.0005922256968915462, "8": 0.14567629992961884, "9": 0.0005043070996180177, "10": 0.00046946032671257854, "11": 0.0004391180700622499, "12": 0.33007749915122986, "13": 0.3098098635673523, "14": 0.0003678022767417133, "15": 0.000348913628840819, "16": 0.0003318703093100339, "17": 0.0003164144582115114, "18": 0.00030233414145186543, "19": 0.00028945357189513743}}, {"key": "allamanis2019adverse", "year": "2019", "title": "The Adverse Effects of Code Duplication in Machine Learning Models of Code", "topic_distr": {"0": 0.8360024690628052, "1": 0.0020367044489830732, "2": 0.001721672946587205, "3": 0.0014911944745108485, "4": 0.0013151675229892135, "5": 0.0011763201327994466, "6": 0.14755146205425262, "7": 0.0009712455212138593, "8": 0.0008933722856454551, "9": 0.0008270596736110747, "10": 0.0007699112757109106, "11": 0.0007201501866802573, "12": 0.0006764308200217783, "13": 0.0006377159734256566, "14": 0.0006031928351148963, "15": 0.0005722155910916626, "16": 0.000544264679774642, "17": 0.000518917222507298, "18": 0.0004958256613463163, "19": 0.0004747015773318708}}, {"key": "allamanis2020typilus", "year": "2020", "title": "Typilus: Neural Type Hints", "topic_distr": {"0": 0.0022670384496450424, "1": 0.0018517745193094015, "2": 0.0015650822315365076, "3": 0.0013556296471506357, "4": 0.001195593737065792, "5": 0.0010693712392821908, "6": 0.0009672553278505802, "7": 0.0008829417638480663, "8": 0.06742677837610245, "9": 0.7530001997947693, "10": 0.0006999124307185411, "11": 0.0006546755321323872, "12": 0.16356663405895233, "13": 0.0005797360208816826, "14": 0.0005483517306856811, "15": 0.0005201908643357456, "16": 0.0004947811830788851, "17": 0.0004717382544185966, "18": 0.0004507461271714419, "19": 0.0004315426340326667}}, {"key": "allamanis2021self", "year": "2021", "title": "Self-Supervised Bug Detection and Repair", "topic_distr": {"0": 0.0025481863413006067, "1": 0.0020782339852303267, "2": 0.18657051026821136, "3": 0.0015216409228742123, "4": 0.06740624457597733, "5": 0.3491215109825134, "6": 0.0010857207234948874, "7": 0.0009910807712003589, "8": 0.000911617127712816, "9": 0.0008439502562396228, "10": 0.0007856347365304828, "11": 0.0007348573999479413, "12": 0.03444504365324974, "13": 0.05472979694604874, "14": 0.0006155114970169961, "15": 0.2935352325439453, "16": 0.0005553798982873559, "17": 0.0005295147420838475, "18": 0.000505951582454145, "19": 0.00048439615056850016}}, {"key": "alon2018code2seq", "year": "2019", "title": "code2seq: Generating Sequences from Structured Representations of Code", "topic_distr": {"0": 0.0021526399068534374, "1": 0.001755962846800685, "2": 0.0014841962838545442, "3": 0.0012855327222496271, "4": 0.001133787794969976, "5": 0.0010140878148376942, "6": 0.0009172509307973087, "7": 0.000837296131066978, "8": 0.0007701627910137177, "9": 0.0007129957084544003, "10": 0.0006637288606725633, "11": 0.0006208306294865906, "12": 0.35666176676750183, "13": 0.0005497653037309647, "14": 0.20037437975406647, "15": 0.0004932984593324363, "16": 0.0004692023794632405, "17": 0.1454165130853653, "18": 0.0004274437960702926, "19": 0.28225913643836975}}, {"key": "alon2018general", "year": "2018", "title": "A General Path-Based Representation for Predicting Program Properties", "topic_distr": {"0": 0.001485423999838531, "1": 0.0012122917687520385, "2": 0.0010247883619740605, "3": 0.19171267747879028, "4": 0.0624798983335495, "5": 0.0007001911289989948, "6": 0.0006333286873996258, "7": 0.0005781227955594659, "8": 0.08659805357456207, "9": 0.2658021152019501, "10": 0.0004582808760460466, "11": 0.00042866115109063685, "12": 0.14678268134593964, "13": 0.07561329752206802, "14": 0.000359043653588742, "15": 0.00034060480538755655, "16": 0.0003239673387724906, "17": 0.00030887953471392393, "18": 0.16287513077259064, "19": 0.00028256067889742553}}, {"key": "alon2019code2vec", "year": "2019", "title": "code2vec: Learning Distributed Representations of Code", "topic_distr": {"0": 0.10722559690475464, "1": 0.0011843906249850988, "2": 0.1597634106874466, "3": 0.223061665892601, "4": 0.0007646462181583047, "5": 0.0006839190027676523, "6": 0.0006186104728840292, "7": 0.0005646874778904021, "8": 0.0005194115801714361, "9": 0.1479332149028778, "10": 0.00044763064943253994, "11": 0.00041869928827509284, "12": 0.20545095205307007, "13": 0.149497389793396, "14": 0.0003506996436044574, "15": 0.00033268932020291686, "16": 0.00031643849797546864, "17": 0.0003017013368662447, "18": 0.0002882757689803839, "19": 0.000275994127150625}}, {"key": "alon2019structural", "year": "2019", "title": "Structural Language Models for Any-Code Generation", "topic_distr": {"0": 0.002401187550276518, "1": 0.0019585811533033848, "2": 0.0016554603353142738, "3": 0.001433836529031396, "4": 0.13687549531459808, "5": 0.0011310765985399485, "6": 0.001023068092763424, "7": 0.0009338894160464406, "8": 0.11844903975725174, "9": 0.0007952492451295257, "10": 0.0007402988849207759, "11": 0.0006924517219886184, "12": 0.47989422082901, "13": 0.0006131880800239742, "14": 0.0005799928330816329, "15": 0.000550207041669637, "16": 0.0005233311676420271, "17": 0.0004989585722796619, "18": 0.0004767551727127284, "19": 0.2487737238407135}}, {"key": "amodio2017neural", "year": "2017", "title": "Neural Attribute Machines for Program Generation", "topic_distr": {"0": 0.0027100571896880865, "1": 0.002213959814980626, "2": 0.0018713506869971752, "3": 0.0016208769520744681, "4": 0.0014295323053374887, "5": 0.0012786091538146138, "6": 0.053966496139764786, "7": 0.0010557019850239158, "8": 0.0009710571612231433, "9": 0.0008989782072603703, "10": 0.0008368603303097188, "11": 0.07298249006271362, "12": 0.13704590499401093, "13": 0.000693169713485986, "14": 0.0006556446314789355, "15": 0.0006219737115316093, "16": 0.000591592222917825, "17": 0.7175008058547974, "18": 0.0005389410653151572, "19": 0.0005159801803529263}}, {"key": "arakelyan2020towards", "year": "2020", "title": "Towards Learning Representations of Binary Executable Files for Security Tasks", "topic_distr": {"0": 0.002711244160309434, "1": 0.0022139190696179867, "2": 0.001871404587291181, "3": 0.14903688430786133, "4": 0.0014295432483777404, "5": 0.19721196591854095, "6": 0.0011565189342945814, "7": 0.0010557076893746853, "8": 0.24858440458774567, "9": 0.0008989830384962261, "10": 0.0008368648705072701, "11": 0.000782776391133666, "12": 0.3880283832550049, "13": 0.0006931734969839454, "14": 0.0006556481239385903, "15": 0.0006219770293682814, "16": 0.0005915954243391752, "17": 0.0005640436429530382, "18": 0.0005389439756982028, "19": 0.0005159829161129892}}, {"key": "ashwath2020predicting", "year": "2020", "title": "Predicting Vulnerability in Large Codebases With Deep Code Representation", "topic_distr": {"0": 0.0014338825130835176, "1": 0.0011705431388691068, "2": 0.0009894476970657706, "3": 0.13747656345367432, "4": 0.0007558501674793661, "5": 0.47238895297050476, "6": 0.0006114950519986451, "7": 0.06722278147935867, "8": 0.17382670938968658, "9": 0.00047532611642964184, "10": 0.12090957909822464, "11": 0.0004138833028264344, "12": 0.0003887570055667311, "13": 0.00036650686524808407, "14": 0.00034666582359932363, "15": 0.020054273307323456, "16": 0.00031279874383471906, "17": 0.0002982310834340751, "18": 0.00028495994047261775, "19": 0.00027281956863589585}}, {"key": "aye2020learning", "year": "2020", "title": "Learning Autocompletion from Real-World Datasets", "topic_distr": {"0": 0.2502075433731079, "1": 0.0019215599168092012, "2": 0.001624296186491847, "3": 0.0014067665906623006, "4": 0.0012407161993905902, "5": 0.0011097287060692906, "6": 0.0010037586325779557, "7": 0.0009162631467916071, "8": 0.0008427983266301453, "9": 0.0007802397012710571, "10": 0.7339997291564941, "11": 0.0006793823558837175, "12": 0.0006381379789672792, "13": 0.0006016147672198713, "14": 0.0005690460093319416, "15": 0.000539822387509048, "16": 0.000513453793246299, "17": 0.0004895412130281329, "18": 0.00046775685041211545, "19": 0.0004478286427911371}}, {"key": "aye2020sequence", "year": "2020", "title": "Sequence Model Design for Code Completion in the Modern IDE", "topic_distr": {"0": 0.0012732821051031351, "1": 0.2965857684612274, "2": 0.0008784107631072402, "3": 0.04090014845132828, "4": 0.0006710082525387406, "5": 0.05650894716382027, "6": 0.0005428564618341625, "7": 0.0004955368349328637, "8": 0.0004558053333312273, "9": 0.00042197215952910483, "10": 0.5985909700393677, "11": 0.00036742610973306, "12": 0.00034512014826759696, "13": 0.0003253675240557641, "14": 0.0003077535948250443, "15": 0.00029194875969551504, "16": 0.00027768799918703735, "17": 0.00026475550839677453, "18": 0.0002529740158934146, "19": 0.00024219635815825313}}, {"key": "bai2021jointly", "year": "2021", "title": "Jointly Learning to Repair Code and Generate Commit Message", "topic_distr": {"0": 0.0020790842827409506, "1": 0.0016971237491816282, "2": 0.25303056836128235, "3": 0.0012426425237208605, "4": 0.19332844018936157, "5": 0.0009802572894841433, "6": 0.0008866509306244552, "7": 0.028752239421010017, "8": 0.0007444697548635304, "9": 0.0006892097881063819, "10": 0.0006415865500457585, "11": 0.26090288162231445, "12": 0.0005636869464069605, "13": 0.000531424826476723, "14": 0.0005026558646932244, "15": 0.25173234939575195, "16": 0.0004535495536401868, "17": 0.0004324268375057727, "18": 0.0004131840541958809, "19": 0.0003955808642785996}}, {"key": "barchi2019code", "year": "2019", "title": "Code Mapping in Heterogeneous Platforms Using Deep Learning and LLVM-IR", "topic_distr": {"0": 0.003465965623036027, "1": 0.002828799420967698, "2": 0.1782567799091339, "3": 0.002071104943752289, "4": 0.0018266349798068404, "5": 0.1474480926990509, "6": 0.0014777736505493522, "7": 0.09766323119401932, "8": 0.0012408014154061675, "9": 0.001148700132034719, "10": 0.0010693268850445747, "11": 0.0010002139024436474, "12": 0.2073763608932495, "13": 0.0008857212960720062, "14": 0.0008377723279409111, "15": 0.0007947481353767216, "16": 0.0007559271762147546, "17": 0.34850409626960754, "18": 0.0006886503542773426, "19": 0.0006593112484551966}}, {"key": "barchi2021exploration", "year": "2021", "title": "Exploration of Convolutional Neural Network models for source code classification", "topic_distr": {"0": 0.0023545543663203716, "1": 0.367083340883255, "2": 0.10670661181211472, "3": 0.0014067915035411716, "4": 0.0012407341273501515, "5": 0.0011097437236458063, "6": 0.0010037722531706095, "7": 0.0009162755450233817, "8": 0.0008428097353316844, "9": 0.0007802502368576825, "10": 0.000726336264051497, "11": 0.0006793915526941419, "12": 0.0006381465937010944, "13": 0.2678847908973694, "14": 0.0005690536927431822, "15": 0.0005398296634666622, "16": 0.0005134607199579477, "17": 0.24408850073814392, "18": 0.000467763195047155, "19": 0.0004478346963878721}}, {"key": "barchi2022deep", "year": "2022", "title": "Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities", "topic_distr": {"0": 0.686924159526825, "1": 0.002121787518262863, "2": 0.001793382689356804, "3": 0.0015533199766650796, "4": 0.0013699608389288187, "5": 0.0012253285385668278, "6": 0.001108319847844541, "7": 0.0010117099154740572, "8": 0.0009305923595093191, "9": 0.0008615169790573418, "10": 0.0008019875967875123, "11": 0.000750153383705765, "12": 0.0007046125829219818, "13": 0.0006642847438342869, "14": 0.0006283232942223549, "15": 0.0005960554699413478, "16": 0.0005669400561600924, "17": 0.14766386151313782, "18": 0.1482292115688324, "19": 0.0004944787942804396}}, {"key": "bareiss2022code", "year": "2022", "title": "Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code", "topic_distr": {"0": 0.0012724485713988543, "1": 0.0010390899842604995, "2": 0.0008784145466051996, "3": 0.000760803057346493, "4": 0.9904649257659912, "5": 0.0006001597503200173, "6": 0.0005428494187071919, "7": 0.0004955304320901632, "8": 0.0004557994252536446, "9": 0.0004219666589051485, "10": 0.0003928094811271876, "11": 0.0003674213367048651, "12": 0.0003451156662777066, "13": 0.00032536330400034785, "14": 0.00030774957849644125, "15": 0.0002919449470937252, "16": 0.0002776843903120607, "17": 0.00026475207414478064, "18": 0.000252970727160573, "19": 0.0002421932149445638}}, {"key": "barke2022grounded", "year": "2022", "title": "Grounded Copilot: How Programmers Interact with Code-Generating Models", "topic_distr": {"0": 0.29114681482315063, "1": 0.002679581753909588, "2": 0.002265258925035596, "3": 0.0019620859529823065, "4": 0.0017304903594776988, "5": 0.0015477921115234494, "6": 0.0013999911025166512, "7": 0.0012779568787664175, "8": 0.0011754919541999698, "9": 0.0010882383212447166, "10": 0.0010130428709089756, "11": 0.04968247935175896, "12": 0.0008900421671569347, "13": 0.0008391014416702092, "14": 0.0007936762413010001, "15": 0.0007529166177846491, "16": 0.0007161390385590494, "17": 0.0006827870383858681, "18": 0.6377314925193787, "19": 0.0006246084813028574}}, {"key": "barone2017parallel", "year": "2017", "title": "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation", "topic_distr": {"0": 0.3611259460449219, "1": 0.0015913271345198154, "2": 0.0013450697297230363, "3": 0.0011649997904896736, "4": 0.10392921417951584, "5": 0.0009190123528242111, "6": 0.0008312544086948037, "7": 0.0007587957661598921, "8": 0.0006979564786888659, "9": 0.0006461490411311388, "10": 0.00060150126228109, "11": 0.0005626248894259334, "12": 0.0005284686922095716, "13": 0.0004982222453691065, "14": 0.21536418795585632, "15": 0.00044704944593831897, "16": 0.0004252124927006662, "17": 0.00040540951886214316, "18": 0.0003873689565807581, "19": 0.3077702522277832}}, {"key": "bavarian2022efficient", "year": "2022", "title": "Efficient Training of Language Models to Fill in the Middle", "topic_distr": {"0": 0.6441851854324341, "1": 0.0018856117967516184, "2": 0.001594177563674748, "3": 0.0013807315845042467, "4": 0.0012177545577287674, "5": 0.0010891907149925828, "6": 0.09406358003616333, "7": 0.025103669613599777, "8": 0.22314640879631042, "9": 0.0007657998939976096, "10": 0.0007128844154067338, "11": 0.0006668090936727822, "12": 0.000626328052021563, "13": 0.0005904807476326823, "14": 0.0005585147300735116, "15": 0.0005298319738358259, "16": 0.0005039513343945146, "17": 0.000480481336126104, "18": 0.00045910014887340367, "19": 0.00043954074499197304}}, {"key": "bavishi2017context2name", "year": "2017", "title": "Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts", "topic_distr": {"0": 0.001889220904558897, "1": 0.0015432402724400163, "2": 0.0013042798964306712, "3": 0.0011296779848635197, "4": 0.0009963233023881912, "5": 0.0008911379845812917, "6": 0.0008060416439548135, "7": 0.0007357807480730116, "8": 0.0006767868180759251, "9": 0.2513822913169861, "10": 0.000583257176913321, "11": 0.0005455599748529494, "12": 0.0005124397575855255, "13": 0.7345727682113647, "14": 0.0004569572629407048, "15": 0.00043349002953618765, "16": 0.0004123154212720692, "17": 0.0003931130631826818, "18": 0.00037561971112154424, "19": 0.00035961688263341784}}, {"key": "bavishi2019autopandas", "year": "2019", "title": "AutoPandas: neural-backed generators for program synthesis", "topic_distr": {"0": 0.03791432082653046, "1": 0.1845201700925827, "2": 0.0013042570790275931, "3": 0.0011296909069642425, "4": 0.0009963461197912693, "5": 0.000891157949808985, "6": 0.1399388313293457, "7": 0.25783267617225647, "8": 0.0006768020102754235, "9": 0.0006265647825784981, "10": 0.0005832702154293656, "11": 0.0005455721984617412, "12": 0.0005124512244947255, "13": 0.00048312157741747797, "14": 0.0004569675074890256, "15": 0.0004334997502155602, "16": 0.18614186346530914, "17": 0.0003931218816433102, "18": 0.18425963819026947, "19": 0.0003596249734982848}}, {"key": "beltramelli2017pix2code", "year": "2017", "title": "pix2code: Generating Code from a Graphical User Interface Screenshot", "topic_distr": {"0": 0.004300374537706375, "1": 0.003511881222948432, "2": 0.18482062220573425, "3": 0.0025709972251206636, "4": 0.2643677294254303, "5": 0.0020281164906919003, "6": 0.0018344480777159333, "7": 0.0016745432512834668, "8": 0.0015402805292978883, "9": 0.0014259496238082647, "10": 0.0013274189550429583, "11": 0.2373671978712082, "12": 0.0011662475299090147, "13": 0.001099498476833105, "14": 0.0010399764869362116, "15": 0.0009865680476650596, "16": 0.0009383772849105299, "17": 0.0008946752059273422, "18": 0.2862866222858429, "19": 0.0008184422040358186}}, {"key": "bennun2018neural", "year": "2018", "title": "Neural Code Comprehension: A Learnable Representation of Code Semantics", "topic_distr": {"0": 0.0015407754108309746, "1": 0.24417360126972198, "2": 0.0010627800365909934, "3": 0.21444831788539886, "4": 0.0008118377299979329, "5": 0.0007261279970407486, "6": 0.06189774349331856, "7": 0.0005995379178784788, "8": 0.0005514677031897008, "9": 0.000510533747728914, "10": 0.000475256732897833, "11": 0.0004445398517418653, "12": 0.4703828692436218, "13": 0.0003936542198061943, "14": 0.0003723435220308602, "15": 0.0003532216651365161, "16": 0.00033596789580769837, "17": 0.000320321210892871, "18": 0.000306067056953907, "19": 0.00029302743496373296}}, {"key": "berabi2021tfix", "year": "2021", "title": "TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer", "topic_distr": {"0": 0.0016870776889845729, "1": 0.00137619161978364, "2": 0.0011633374961093068, "3": 0.0010075713507831097, "4": 0.1615121215581894, "5": 0.2213243544101715, "6": 0.0007189236930571496, "7": 0.07527393847703934, "8": 0.0006036388804204762, "9": 0.15509748458862305, "10": 0.0005202179891057312, "11": 0.0004865951486863196, "12": 0.0004570546152535826, "13": 0.00043089553946629167, "14": 0.0004075687611475587, "15": 0.3311689794063568, "16": 0.045757606625556946, "17": 0.0003506249049678445, "18": 0.00033502228325232863, "19": 0.00032074906630441546}}, {"key": "berabi2024deepcode", "year": "2024", "title": "DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models", "topic_distr": {"0": 0.37506547570228577, "1": 0.0008856958011165261, "2": 0.0007485714158974588, "3": 0.000648355926387012, "4": 0.0005718239699490368, "5": 0.18506929278373718, "6": 0.0004626142035704106, "7": 0.0004222891293466091, "8": 0.063203364610672, "9": 0.00035959837259724736, "10": 0.03916070982813835, "11": 0.00031311504426412284, "12": 0.00029410625575110316, "13": 0.00027727335691452026, "14": 0.0002622630272526294, "15": 0.25909674167633057, "16": 0.00023664157197345048, "17": 0.00022562069352716208, "18": 0.07249004393815994, "19": 0.00020639612921513617}}, {"key": "bhatia2016automated", "year": "2016", "title": "Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks", "topic_distr": {"0": 0.0013270111521705985, "1": 0.0010833670385181904, "2": 0.0009157813037745655, "3": 0.0007931809523142874, "4": 0.000699553987942636, "5": 0.0006256997003220022, "6": 0.0005659506423398852, "7": 0.48301395773887634, "8": 0.00047519613872282207, "9": 0.00043992363498546183, "10": 0.00040952564449980855, "11": 0.0003830570785794407, "12": 0.17269474267959595, "13": 0.00033920927671715617, "14": 0.0003208459820598364, "15": 0.16784420609474182, "16": 0.0002895013603847474, "17": 0.0002760186907835305, "18": 0.16725072264671326, "19": 0.0002524998562876135}}, {"key": "bhatia2018neurosymbolic", "year": "2018", "title": "Neuro-symbolic program corrector for introductory programming assignments", "topic_distr": {"0": 0.0017335645388811827, "1": 0.0014144248561933637, "2": 0.001195647637359798, "3": 0.0010355577105656266, "4": 0.0009133225539699197, "5": 0.0008169001084752381, "6": 0.0007388927624560893, "7": 0.3748911917209625, "8": 0.0006204057135619223, "9": 0.0005743546644225717, "10": 0.000534667749889195, "11": 0.0005001109675504267, "12": 0.199063241481781, "13": 0.0004428642278071493, "14": 0.00041888951091095805, "15": 0.24765242636203766, "16": 0.00037796664400957525, "17": 0.0003603639779612422, "18": 0.16638554632663727, "19": 0.00032965827267616987}}, {"key": "bhoopchand2016learning", "year": "2016", "title": "Learning Python Code Suggestion with a Sparse Pointer Network", "topic_distr": {"0": 0.001520854071713984, "1": 0.15261921286582947, "2": 0.0010497980983927846, "3": 0.0009092726395465434, "4": 0.0008019428933039308, "5": 0.0007172792684286833, "6": 0.0006487850332632661, "7": 0.0005922318086959422, "8": 0.0005447473959065974, "9": 0.15768678486347198, "10": 0.6109978556632996, "11": 0.00043912255205214024, "12": 0.0004124640254303813, "13": 0.0003888570354320109, "14": 0.00036780606023967266, "15": 0.0003489172086119652, "16": 0.06904587149620056, "17": 0.00031641768873669207, "18": 0.0003023372555617243, "19": 0.000289456540485844}}, {"key": "bian2020sinkfinder", "year": "2020", "title": "SinkFinder: harvesting hundreds of unknown interesting function pairs with just one seed", "topic_distr": {"0": 0.0017579166451469064, "1": 0.001434262958355248, "2": 0.001212513423524797, "3": 0.0010501439683139324, "4": 0.0009261852246709168, "5": 0.24680732190608978, "6": 0.0007492969161830842, "7": 0.4665735363960266, "8": 0.0006291415193118155, "9": 0.0005824420368298888, "10": 0.0005421962705440819, "11": 0.000507152930367738, "12": 0.0004763643373735249, "13": 0.00044910007272846997, "14": 0.00042478780960664153, "15": 0.0004029726260341704, "16": 0.00038328871596604586, "17": 0.0003654381725937128, "18": 0.27439165115356445, "19": 0.0003343001299072057}}, {"key": "bibaev2022all", "year": "2022", "title": "All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs", "topic_distr": {"0": 0.0020806158427149057, "1": 0.0016972364392131567, "2": 0.05887734889984131, "3": 0.0012426701141521335, "4": 0.0010959826176986098, "5": 0.0009802753338590264, "6": 0.0008866672869771719, "7": 0.000809378398116678, "8": 0.000744483491871506, "9": 0.000689222477376461, "10": 0.3952975869178772, "11": 0.0006001304718665779, "12": 0.0005636973073706031, "13": 0.0005314346635714173, "14": 0.0005026651197113097, "15": 0.0004768505459651351, "16": 0.00045355790643952787, "17": 0.00043243481195531785, "18": 0.33580610156059265, "19": 0.1962316781282425}}, {"key": "bichsel2016statistical", "year": "2016", "title": "Statistical Deobfuscation of Android Applications", "topic_distr": {"0": 0.0016202267725020647, "1": 0.0013226446462795138, "2": 0.00111799081787467, "3": 0.0009683191310614347, "4": 0.0008540122071281075, "5": 0.0007638510433025658, "6": 0.000690909568220377, "7": 0.0006306844879873097, "8": 0.0005801169900223613, "9": 0.43727371096611023, "10": 0.0004999467637389898, "11": 0.00046763409045524895, "12": 0.0004392446717247367, "13": 0.43547406792640686, "14": 0.00039168712100945413, "15": 0.0003715718339662999, "16": 0.00035342175397090614, "17": 0.0003369621990714222, "18": 0.1155347228050232, "19": 0.0003082504845224321}}, {"key": "bieber2020learning", "year": "2020", "title": "Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks", "topic_distr": {"0": 0.0013560893712565303, "1": 0.0011068558087572455, "2": 0.000935753108933568, "3": 0.0008104312582872808, "4": 0.0007147710421122611, "5": 0.000639309932012111, "6": 0.0005782610387541354, "7": 0.0005278552998788655, "8": 0.0004855325387325138, "9": 0.0004494927707128227, "10": 0.01823742873966694, "11": 0.0003913892724085599, "12": 0.8172847032546997, "13": 0.0003465876798145473, "14": 0.0003278249641880393, "15": 0.15470239520072937, "16": 0.0002957985270768404, "17": 0.00028202260727994144, "18": 0.0002694727445486933, "19": 0.000257992185652256}}, {"key": "bieber2022static", "year": "2022", "title": "Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions", "topic_distr": {"0": 0.17360959947109222, "1": 0.0014759645564481616, "2": 0.0012475684052333236, "3": 0.001080573070794344, "4": 0.0009530278039164841, "5": 0.14698874950408936, "6": 0.0007710134377703071, "7": 0.15743468701839447, "8": 0.0006473755929619074, "9": 0.12527726590633392, "10": 0.0005579104763455689, "11": 0.0005218514706939459, "12": 0.0004901705542579293, "13": 0.00046211612061597407, "14": 0.0004370991955511272, "15": 0.17444796860218048, "16": 0.00039439735701307654, "17": 0.00037602949305437505, "18": 0.21248263120651245, "19": 0.0003439889696892351}}, {"key": "bielik2016phog", "year": "2016", "title": "PHOG: Probabilistic Model for Code", "topic_distr": {"0": 0.003197598969563842, "1": 0.002611184027045965, "2": 0.0022072596475481987, "3": 0.0019117803312838078, "4": 0.2039952427148819, "5": 0.0015081085730344057, "6": 0.20941899716854095, "7": 0.0012451912043616176, "8": 0.2039267122745514, "9": 0.0010603368282318115, "10": 0.0009870693320408463, "11": 0.0009232728625647724, "12": 0.0008672222029417753, "13": 0.3620257079601288, "14": 0.0007733270176686347, "15": 0.0007336124544963241, "16": 0.0006977778393775225, "17": 0.0006652809679508209, "18": 0.0006356762605719268, "19": 0.0006085940403863788}}, {"key": "bielik2020adversarial", "year": "2020", "title": "Adversarial Robustness for Code", "topic_distr": {"0": 0.0025995858013629913, "1": 0.19376350939273834, "2": 0.2198163866996765, "3": 0.1424943059682846, "4": 0.0013699731789529324, "5": 0.21448664367198944, "6": 0.0011083281133323908, "7": 0.0010117175988852978, "8": 0.0009305993444286287, "9": 0.2161550521850586, "10": 0.0008019936503842473, "11": 0.0007501590298488736, "12": 0.0007046178798191249, "13": 0.0006642897496931255, "14": 0.0006283280672505498, "15": 0.0005960600101388991, "16": 0.000566944363527, "17": 0.0005405406118370593, "18": 0.0005164868198335171, "19": 0.0004944825195707381}}, {"key": "bouzenia2023tracefixer", "year": "2023", "title": "TraceFixer: Execution Trace-Driven Program Repair", "topic_distr": {"0": 0.001540163648314774, "1": 0.0012573222629725933, "2": 0.0010627430165186524, "3": 0.0009204884408973157, "4": 0.0008118345867842436, "5": 0.0007261265418492258, "6": 0.0006567873642779887, "7": 0.0005995366955175996, "8": 0.0005514665972441435, "9": 0.0005105326999910176, "10": 0.0004752557724714279, "11": 0.00044453892041929066, "12": 0.0004175515496172011, "13": 0.0003936534048989415, "14": 0.0003723427653312683, "15": 0.9880043268203735, "16": 0.0003359671973157674, "17": 0.0003203205415047705, "18": 0.00030606641666963696, "19": 0.0002930268528871238}}, {"key": "bouzenia2024repairagent", "year": "2024", "title": "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair", "topic_distr": {"0": 0.1803947240114212, "1": 0.0013577014906331897, "2": 0.0011477635707706213, "3": 0.0009941215394064784, "4": 0.0008767704130150378, "5": 0.0007842066115699708, "6": 0.0007093212916515768, "7": 0.0006474912515841424, "8": 0.0005955762462690473, "9": 0.0005513682262971997, "10": 0.0005132696242071688, "11": 0.00048009585589170456, "12": 0.0004509498830884695, "13": 0.00042514019878581166, "14": 0.000402124976972118, "15": 0.7171220779418945, "16": 0.00036283989902585745, "17": 0.0003459417202975601, "18": 0.09152208268642426, "19": 0.0003164648951496929}}, {"key": "brach2024can", "year": "2024", "title": "Can Large Language Model Detect Plagiarism in Source Code?", "topic_distr": {"0": 0.5045185685157776, "1": 0.0014759631594642997, "2": 0.0012476242845878005, "3": 0.07668711245059967, "4": 0.0009530444513075054, "5": 0.31350231170654297, "6": 0.0007710268255323172, "7": 0.000703818048350513, "8": 0.0006473868270404637, "9": 0.0005993330269120634, "10": 0.0005579201388172805, "11": 0.0005218604928813875, "12": 0.0004901790525764227, "13": 0.0949983224272728, "14": 0.0004371067916508764, "15": 0.00041465897811576724, "16": 0.0003944041964132339, "17": 0.00037603601231239736, "18": 0.0003593025903683156, "19": 0.0003439949359744787}}, {"key": "brauckmann2020compiler", "year": "2020", "title": "Compiler-based graph representations for deep learning models of code", "topic_distr": {"0": 0.05702033266425133, "1": 0.09617647528648376, "2": 0.0008045457652769983, "3": 0.0006968242232687771, "4": 0.12074983865022659, "5": 0.0005496886442415416, "6": 0.0004971980233676732, "7": 0.0004538583161775023, "8": 0.0004174685454927385, "9": 0.018937036395072937, "10": 0.0003597757895477116, "11": 0.0003365227021276951, "12": 0.37557974457740784, "13": 0.0002980015706270933, "14": 0.00028186908457428217, "15": 0.00026739356690086424, "16": 0.0002543322625569999, "17": 0.32586556673049927, "18": 0.0002316969184903428, "19": 0.00022182574321050197}}, {"key": "brauckmann2020compy", "year": "2020", "title": "ComPy-Learn: A toolbox for exploring machine learning representations for compilers", "topic_distr": {"0": 0.00192009296733886, "1": 0.0015668238047510386, "2": 0.05355857312679291, "3": 0.19543032348155975, "4": 0.0010116650955751538, "5": 0.22385554015636444, "6": 0.0008184523321688175, "7": 0.000747109588701278, "8": 0.0006872072699479759, "9": 0.0006361977430060506, "10": 0.0005922375712543726, "11": 0.0005539599223993719, "12": 0.1620096117258072, "13": 0.0004905491950921714, "14": 0.0004639930266421288, "15": 0.0004401644691824913, "16": 0.00041866383980959654, "17": 0.35405221581459045, "18": 0.000381403136998415, "19": 0.0003651539154816419}}, {"key": "briem2020offside", "year": "2020", "title": "OffSide: Learning to Identify Mistakes in Boundary Conditions", "topic_distr": {"0": 0.0019188614096492529, "1": 0.0015668160049244761, "2": 0.0013243762077763677, "3": 0.00114706892054528, "4": 0.001011664979159832, "5": 0.9855165481567383, "6": 0.0008184527978301048, "7": 0.0007471099961549044, "8": 0.0006872076774016023, "9": 0.000636198150459677, "10": 0.0005922379205003381, "11": 0.0005539602716453373, "12": 0.0005203300970606506, "13": 0.000490549486130476, "14": 0.0004639933176804334, "15": 0.0004401647311169654, "16": 0.0004186640726402402, "17": 0.0003991660487372428, "18": 0.0003814033407252282, "19": 0.0003651541192084551}}, {"key": "brockschmidt2019generative", "year": "2019", "title": "Generative Code Modeling with Graphs", "topic_distr": {"0": 0.003040601732209325, "1": 0.002484084339812398, "2": 0.002099562669172883, "3": 0.0018185258377343416, "4": 0.0016038704197853804, "5": 0.0014345435192808509, "6": 0.30774155259132385, "7": 0.0011844511609524488, "8": 0.2755334675312042, "9": 0.0010086139664053917, "10": 0.0009389204205945134, "11": 0.21158811450004578, "12": 0.18483245372772217, "13": 0.0007777059217914939, "14": 0.0007356043788604438, "15": 0.0006978270830586553, "16": 0.0006637404439970851, "17": 0.00063282874180004, "18": 0.0006046681082807481, "19": 0.0005789069691672921}}, {"key": "brody2020structural", "year": "2020", "title": "A Structural Model for Contextual Code Changes", "topic_distr": {"0": 0.10546102374792099, "1": 0.0014343546936288476, "2": 0.0012124436907470226, "3": 0.0010501501383259892, "4": 0.0009261924424208701, "5": 0.0008284098003059626, "6": 0.0007493036100640893, "7": 0.000683988444507122, "8": 0.4100707769393921, "9": 0.0005824472173117101, "10": 0.13027168810367584, "11": 0.0005071574123576283, "12": 0.34351301193237305, "13": 0.000449104089057073, "14": 0.0004247915931046009, "15": 0.000402976234909147, "16": 0.0003832921211142093, "17": 0.00036544143222272396, "18": 0.0003491794632282108, "19": 0.0003343030984979123}}, {"key": "bruch2009learning", "year": "2009", "title": "Learning from Examples to Improve Code Completion Systems", "topic_distr": {"0": 0.00215074117295444, "1": 0.0017558963736519217, "2": 0.0014841948868706822, "3": 0.0012855074601247907, "4": 0.0011337592732161283, "5": 0.0010140632512047887, "6": 0.000917228520847857, "7": 0.0008372757001779974, "8": 0.0007701439899392426, "9": 0.0634360983967781, "10": 0.6006993055343628, "11": 0.0006208154372870922, "12": 0.0005831265589222312, "13": 0.0005497519159689546, "14": 0.0005199907463975251, "15": 0.0004932864103466272, "16": 0.000469190941657871, "17": 0.00044733978575095534, "18": 0.32042309641838074, "19": 0.00040922308107838035}}, {"key": "buech2019learning", "year": "2019", "title": "Learning-based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection", "topic_distr": {"0": 0.0020142686553299427, "1": 0.15706206858158112, "2": 0.14659632742404938, "3": 0.4142262041568756, "4": 0.001060637878254056, "5": 0.0009486626950092614, "6": 0.0610724613070488, "7": 0.07238650321960449, "8": 0.000720474636182189, "9": 0.0006669957656413317, "10": 0.0006209074635989964, "11": 0.0005807768320664763, "12": 0.13894136250019073, "13": 0.0005142964073456824, "14": 0.0004864546936005354, "15": 0.0004614725767169148, "16": 0.00043893110705539584, "17": 0.00041848921682685614, "18": 0.0003998666361439973, "19": 0.00038283082540147007}}, {"key": "bui2018bilateral", "year": "2018", "title": "Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification", "topic_distr": {"0": 0.0009748736047185957, "1": 0.7128090262413025, "2": 0.0006725238054059446, "3": 0.08729008585214615, "4": 0.0005137405241839588, "5": 0.0004595029167830944, "6": 0.00041562423575669527, "7": 0.0003793951473198831, "8": 0.0003489757073111832, "9": 0.00032307219225913286, "10": 0.00030074844835326076, "11": 0.0002813104074448347, "12": 0.1937284767627716, "13": 0.0002491093473508954, "14": 0.00023562366550322622, "15": 0.00022352310770656914, "16": 0.00021260471839923412, "17": 0.00020270328968763351, "18": 0.00019368309585843235, "19": 0.00018543146143201739}}, {"key": "bui2018cross", "year": "2018", "title": "Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks", "topic_distr": {"0": 0.0020441687665879726, "1": 0.7215282917022705, "2": 0.0014111833879724145, "3": 0.0012222970835864544, "4": 0.0010780130978673697, "5": 0.0009642021032050252, "6": 0.0008721288177184761, "7": 0.0007961072260513902, "8": 0.0007322763558477163, "9": 0.0006779214600101113, "10": 0.0006310782628133893, "11": 0.0005902902339585125, "12": 0.0005544545128941536, "13": 0.0005227208603173494, "14": 0.0004944230895489454, "15": 0.0004690317437052727, "16": 0.00044612103374674916, "17": 0.00042534430394880474, "18": 0.26415085792541504, "19": 0.00038910176954232156}}, {"key": "bui2018hierarchical", "year": "2018", "title": "Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code", "topic_distr": {"0": 0.0011341709177941084, "1": 0.000925793603528291, "2": 0.000782571907620877, "3": 0.2765195369720459, "4": 0.0005978067056275904, "5": 0.0005346938269212842, "6": 0.21811096370220184, "7": 0.17653991281986237, "8": 0.1749810427427292, "9": 0.00037593822344206274, "10": 0.0003499615122564137, "11": 0.0003273427137173712, "12": 0.0003074701817240566, "13": 0.00028987243422307074, "14": 0.03561429679393768, "15": 0.0002600993902888149, "16": 0.00024739434593357146, "17": 0.0002358727069804445, "18": 0.11164950579404831, "19": 0.00021577459119725972}}, {"key": "bui2019learning", "year": "2019", "title": "SAR: Learning Cross-Language API Mappings with Little Knowledge", "topic_distr": {"0": 0.0013725500321015716, "1": 0.0011193891987204552, "2": 0.0009459705324843526, "3": 0.0008193421526812017, "4": 0.000722624477930367, "5": 0.000646333210170269, "6": 0.0005846137646585703, "7": 0.7632049322128296, "8": 0.0004908665432594717, "9": 0.0004544308176264167, "10": 0.0004230304330121726, "11": 0.00039568901411257684, "12": 0.0003716672654263675, "13": 0.0003503952466417104, "14": 0.0878898873925209, "15": 0.0003144058573525399, "16": 0.00029904814437031746, "17": 0.000285120855551213, "18": 0.13904884457588196, "19": 0.00026082643307745457}}, {"key": "bui2021efficient", "year": "2021", "title": "Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations", "topic_distr": {"0": 0.002447474515065551, "1": 0.0019971502479165792, "2": 0.0016879391623660922, "3": 0.10269330441951752, "4": 0.24554035067558289, "5": 0.0011532916687428951, "6": 0.0010431618429720402, "7": 0.0009522316395305097, "8": 0.0008758829208090901, "9": 0.0008108685142360628, "10": 0.0007548388675786555, "11": 0.00070605194196105, "12": 0.06824291497468948, "13": 0.0006252315361052752, "14": 0.0005913842469453812, "15": 0.0005610134685412049, "16": 0.0005336097092367709, "17": 0.0005087584722787142, "18": 0.0004861189518123865, "19": 0.5677884817123413}}, {"key": "bui2021infercode", "year": "2021", "title": "InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees", "topic_distr": {"0": 0.07651884853839874, "1": 0.0010499992640689015, "2": 0.06277231872081757, "3": 0.23398500680923462, "4": 0.3688586354255676, "5": 0.024673134088516235, "6": 0.0005484546418301761, "7": 0.0005006470018997788, "8": 0.00046050577657297254, "9": 0.0004263236769475043, "10": 0.04195277392864227, "11": 0.00037121513742022216, "12": 0.07500398904085159, "13": 0.02021893486380577, "14": 0.00031092725112102926, "15": 0.0002949594345409423, "16": 0.00028055161237716675, "17": 0.09127245843410492, "18": 0.0002555827668402344, "19": 0.0002446939761284739}}, {"key": "cai2020tag", "year": "2020", "title": "TAG : Type Auxiliary Guiding for Code Comment Generation", "topic_distr": {"0": 0.0024948997888714075, "1": 0.002036623889580369, "2": 0.21627335250377655, "3": 0.001491187373176217, "4": 0.0013151658931747079, "5": 0.0011763176880776882, "6": 0.0010639893589541316, "7": 0.0009712436585687101, "8": 0.0008933705976232886, "9": 0.2804451286792755, "10": 0.0007699097623117268, "11": 0.0007201487896963954, "12": 0.0006764295394532382, "13": 0.0006377147510647774, "14": 0.0006031917291693389, "15": 0.0005722145433537662, "16": 0.0005442636320367455, "17": 0.0005189162329770625, "18": 0.0004958246718160808, "19": 0.4863000810146332}}, {"key": "cambronero2019deep", "year": "2019", "title": "When Deep Learning Met Code Search", "topic_distr": {"0": 0.058127742260694504, "1": 0.5060692429542542, "2": 0.1713864654302597, "3": 0.12004678696393967, "4": 0.0007736393017694354, "5": 0.0006919627194292843, "6": 0.0006258859648369253, "7": 0.13772894442081451, "8": 0.0005255204159766436, "9": 0.000486512464703992, "10": 0.00045289527042768896, "11": 0.0004236236563883722, "12": 0.000397906027501449, "13": 0.0003751322510652244, "14": 0.00035482426756061614, "15": 0.00033660209737718105, "16": 0.00032016015029512346, "17": 0.00030524967587552965, "18": 0.0002916661906056106, "19": 0.00027924010646529496}}, {"key": "campbell2014syntax", "year": "2014", "title": "Syntax Errors Just Aren\u2019t Natural: Improving Error Reporting with Language Models", "topic_distr": {"0": 0.23111911118030548, "1": 0.0019214119529351592, "2": 0.0016242319252341986, "3": 0.09614813327789307, "4": 0.0012407447211444378, "5": 0.0011097548995167017, "6": 0.0010037823813036084, "7": 0.527100145816803, "8": 0.0008428182918578386, "9": 0.0007802582113072276, "10": 0.000726343656424433, "11": 0.061529215425252914, "12": 0.0006381530547514558, "13": 0.0006016289698891342, "14": 0.0005690594553016126, "15": 0.026356708258390427, "16": 0.0005134659586474299, "17": 0.04525941237807274, "18": 0.00046776793897151947, "19": 0.0004478392656892538}}, {"key": "casey2024survey", "year": "2024", "title": "A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks", "topic_distr": {"0": 0.26103857159614563, "1": 0.0015430613420903683, "2": 0.0013042871141806245, "3": 0.5737420320510864, "4": 0.0009963519405573606, "5": 0.12823021411895752, "6": 0.0008060635882429779, "7": 0.0007358007715083659, "8": 0.0006768052116967738, "9": 0.0006265677511692047, "10": 0.0005832730093970895, "11": 0.0005455747595988214, "12": 0.026257097721099854, "13": 0.0004831238475162536, "14": 0.00045696969027630985, "15": 0.00043350178748369217, "16": 0.00041232662624679506, "17": 0.0003931237442884594, "18": 0.00037562992656603456, "19": 0.0003596266615204513}}, {"key": "cassano2023can", "year": "2023", "title": "Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions", "topic_distr": {"0": 0.7407647371292114, "1": 0.0016163440886884928, "2": 0.0013664192520081997, "3": 0.0011834740871563554, "4": 0.15552282333374023, "5": 0.0009335813228972256, "6": 0.0008444320410490036, "7": 0.0007708247867412865, "8": 0.0007090210565365851, "9": 0.000656392308883369, "10": 0.0006110367248766124, "11": 0.000571544049307704, "12": 0.0005368463462218642, "13": 0.0005061205010861158, "14": 0.0004787213692907244, "15": 0.09131360054016113, "16": 0.0004319532890804112, "17": 0.0004118363722227514, "18": 0.0003935098648071289, "19": 0.0003767448361031711}}, {"key": "cerulo2013hidden", "year": "2013", "title": "A Hidden Markov Model to Detect Coded Information Islands in Free Text", "topic_distr": {"0": 0.0019804516341537237, "1": 0.001616432797163725, "2": 0.11701317876577377, "3": 0.0011835031909868121, "4": 0.001043802942149341, "5": 0.14001542329788208, "6": 0.4368010461330414, "7": 0.0007708441698923707, "8": 0.2218833863735199, "9": 0.0006564088398590684, "10": 0.0006110520916990936, "11": 0.0005715584265999496, "12": 0.0005368599086068571, "13": 0.0005061332485638559, "14": 0.00047873344738036394, "15": 0.0004541478701867163, "16": 0.07269491255283356, "17": 0.00041184676229022443, "18": 0.0003935197601094842, "19": 0.0003767543239519}}, {"key": "cerulo2015irish", "year": "2015", "title": "Irish: A Hidden Markov Model to detect coded information islands in free text", "topic_distr": {"0": 0.44794800877571106, "1": 0.0015430465573444963, "2": 0.0013043393846601248, "3": 0.0011297061573714018, "4": 0.000996349030174315, "5": 0.000891159288585186, "6": 0.3008580207824707, "7": 0.0007357982103712857, "8": 0.0006768028251826763, "9": 0.0006265655974857509, "10": 0.0005832709721289575, "11": 0.0005455728969536722, "12": 0.0005124518647789955, "13": 0.00048312218859791756, "14": 0.00045696808956563473, "15": 0.0004335003031883389, "16": 0.2391469031572342, "17": 0.00039312237640842795, "18": 0.000375628616893664, "19": 0.00035962541005574167}}, {"key": "chae2016automatically", "year": "2016", "title": "Automatically generating features for learning program analysis heuristics", "topic_distr": {"0": 0.0017078717937693, "1": 0.0013951166765764356, "2": 0.0011792012955993414, "3": 0.0010213759960606694, "4": 0.0009007895132526755, "5": 0.0008056894293986261, "6": 0.0007287526968866587, "7": 0.0006652289885096252, "8": 0.0006118917372077703, "9": 0.0005664727068506181, "10": 0.0005273303831927478, "11": 0.0004932478186674416, "12": 0.0004633034113794565, "13": 0.00043678670772351325, "14": 0.0004131410096306354, "15": 0.00039192396798171103, "16": 0.0003727797302417457, "17": 0.00035541862598620355, "18": 0.9866384863853455, "19": 0.00032513431506231427}}, {"key": "chakraborty2018tree2tree", "year": "2018", "title": "CODIT: Code Editing with Tree-Based Neural Machine Translation", "topic_distr": {"0": 0.0018894376698881388, "1": 0.0015430992934852839, "2": 0.0013043021317571402, "3": 0.0011296854354441166, "4": 0.000996334943920374, "5": 0.0008911472395993769, "6": 0.0008060500840656459, "7": 0.0007357884314842522, "8": 0.20331676304340363, "9": 0.0006265572737902403, "10": 0.000583263230510056, "11": 0.634294331073761, "12": 0.0005124450544826686, "13": 0.0004831157566513866, "14": 0.0004569620359688997, "15": 0.14889007806777954, "16": 0.00041231969953514636, "17": 0.0003931171668227762, "18": 0.00037562361103482544, "19": 0.00035962063702754676}}, {"key": "chakraborty2020deep", "year": "2021", "title": "Deep Learning based Vulnerability Detection: Are We There Yet?", "topic_distr": {"0": 0.3338325321674347, "1": 0.0009983887430280447, "2": 0.0008439570665359497, "3": 0.1015801802277565, "4": 0.0006446960614994168, "5": 0.472393274307251, "6": 0.0005215689307078719, "7": 0.0004761048767250031, "8": 0.0004379314195830375, "9": 0.00040542494389228523, "10": 0.00037741076084785163, "11": 0.00035301788011565804, "12": 0.00033158663427457213, "13": 0.01451636478304863, "14": 0.00029568534228019416, "15": 0.00028050027322024107, "16": 0.0002667987428139895, "17": 0.0002543733862694353, "18": 0.0002430539025226608, "19": 0.07094715535640717}}, {"key": "chakraborty2021multimodal", "year": "2021", "title": "On Multi-Modal Learning of Editing Source Code", "topic_distr": {"0": 0.0022262590937316418, "1": 0.0018191725248470902, "2": 0.001537187141366303, "3": 0.0013314057141542435, "4": 0.001174245378933847, "5": 0.0010502750519663095, "6": 0.0009499825537204742, "7": 0.000867174647282809, "8": 0.0007976456545293331, "9": 0.0007384385680779815, "10": 0.0006874137325212359, "11": 0.9827821850776672, "12": 0.0006039498839527369, "13": 0.000569383439142257, "14": 0.0005385595140978694, "15": 0.0005109015619382262, "16": 0.00048594563850201666, "17": 0.0004633141797967255, "18": 0.000442696938989684, "19": 0.00042383637628518045}}, {"key": "chen2019capturing", "year": "2019", "title": "Capturing source code semantics via tree-based convolution over API-enhanced AST", "topic_distr": {"0": 0.0031975305173546076, "1": 0.13495966792106628, "2": 0.05149172991514206, "3": 0.1951647698879242, "4": 0.001686141244135797, "5": 0.0015081284800544381, "6": 0.0013641148107126355, "7": 0.0012452078517526388, "8": 0.14757972955703735, "9": 0.057250507175922394, "10": 0.0009870826033875346, "11": 0.0009232852607965469, "12": 0.39771023392677307, "13": 0.0008175985421985388, "14": 0.0007733373786322773, "15": 0.0007336222915910184, "16": 0.0006977872108109295, "17": 0.0006652898737229407, "18": 0.0006356847588904202, "19": 0.0006086021894589067}}, {"key": "chen2019literature", "year": "2019", "title": "A Literature Study of Embeddings on Source Code", "topic_distr": {"0": 0.002228784840553999, "1": 0.0018187372479587793, "2": 0.46022287011146545, "3": 0.2012816220521927, "4": 0.001174270175397396, "5": 0.0010502975201234221, "6": 0.0009500027517788112, "7": 0.3243681788444519, "8": 0.0007976625929586589, "9": 0.0007384542259387672, "10": 0.0006874282844364643, "11": 0.0006429982604458928, "12": 0.0006039626896381378, "13": 0.0005693954881280661, "14": 0.0005385709227994084, "15": 0.0005109123885631561, "16": 0.00048595594125799835, "17": 0.0004633240168914199, "18": 0.00044270631042309105, "19": 0.0004238453402649611}}, {"key": "chen2019mining", "year": "2019", "title": "Mining Likely Analogical APIs across Third-Party Libraries via Large-Scale Unsupervised API Semantics Embedding", "topic_distr": {"0": 0.001246970728971064, "1": 0.0010182581609115005, "2": 0.0008608397911302745, "3": 0.0007456054445356131, "4": 0.08236076682806015, "5": 0.0005881669349037111, "6": 0.19075073301792145, "7": 0.6488438844680786, "8": 0.0004466914397198707, "9": 0.0004135347262490541, "10": 0.00038496017805300653, "11": 0.00036007934249937534, "12": 0.0003382194263394922, "13": 0.07003669440746307, "14": 0.0003015999973285943, "15": 0.00028611120069399476, "16": 0.0002721355704125017, "17": 0.00025946166715584695, "18": 0.00024791574105620384, "19": 0.00023735359718557447}}, {"key": "chen2019sequencer", "year": "2019", "title": "SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair", "topic_distr": {"0": 0.002713107503950596, "1": 0.0022138229105621576, "2": 0.0018714673351496458, "3": 0.001620863564312458, "4": 0.001429545460268855, "5": 0.0012786212610080838, "6": 0.0011565233580768108, "7": 0.0010557117639109492, "8": 0.000971066125202924, "9": 0.00089898647274822, "10": 0.0008368680137209594, "11": 0.0007827794179320335, "12": 0.0007352579268626869, "13": 0.0006931761163286865, "14": 0.0006556506268680096, "15": 0.6808002591133118, "16": 0.0005915976944379508, "17": 0.000564045796636492, "18": 0.2986147105693817, "19": 0.0005159848951734602}}, {"key": "chen2021evaluating", "year": "2021", "title": "Evaluating Large Language Models Trained on Code", "topic_distr": {"0": 0.655683696269989, "1": 0.0024838654790073633, "2": 0.06028318405151367, "3": 0.0018185318913310766, "4": 0.1009015142917633, "5": 0.0014345557428896427, "6": 0.0012975673889741302, "7": 0.0011844612890854478, "8": 0.0010894926963374019, "9": 0.0010086225811392069, "10": 0.0009389284532517195, "11": 0.000878243416082114, "12": 0.000824926421046257, "13": 0.06055670604109764, "14": 0.0007356106652878225, "15": 0.0006978330202400684, "16": 0.0006637460901401937, "17": 0.000632834155112505, "18": 0.10630673915147781, "19": 0.0005789119168184698}}, {"key": "chen2021plur", "year": "2021", "title": "PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair", "topic_distr": {"0": 0.2524365186691284, "1": 0.0019214401254430413, "2": 0.0016241990961134434, "3": 0.001406790455803275, "4": 0.3293488323688507, "5": 0.0011097401147708297, "6": 0.001003769226372242, "7": 0.0009162728092633188, "8": 0.0008428072324022651, "9": 0.0007802479085512459, "10": 0.0007263341103680432, "11": 0.0006793895154260099, "12": 0.09675461053848267, "13": 0.0006016210536472499, "14": 0.0005690520047210157, "15": 0.1331414133310318, "16": 0.0005134591483511031, "17": 0.17470788955688477, "18": 0.00046776176895946264, "19": 0.0004478333576116711}}, {"key": "chen2022codet", "year": "2022", "title": "CodeT: Code Generation with Generated Tests", "topic_distr": {"0": 0.20871075987815857, "1": 0.0014758312609046698, "2": 0.39548856019973755, "3": 0.0010805658530443907, "4": 0.21041575074195862, "5": 0.0008524038712494075, "6": 0.0007710064528509974, "7": 0.0007037995383143425, "8": 0.000647369772195816, "9": 0.0005993172526359558, "10": 0.0005579054704867303, "11": 0.09863676130771637, "12": 0.0004901661886833608, "13": 0.00046211195876821876, "14": 0.000437095295637846, "15": 0.0771968737244606, "16": 0.0003943938063457608, "17": 0.00037602611701004207, "18": 0.00035929313162341714, "19": 0.0003439858846832067}}, {"key": "chen2022learning.md", "year": "2022", "title": "Learning to Reverse DNNs from AI Programs Automatically", "topic_distr": {"0": 0.004621532745659351, "1": 0.0037712145131081343, "2": 0.0031882754992693663, "3": 0.20091548562049866, "4": 0.002435480710119009, "5": 0.07730001211166382, "6": 0.0019703442230820656, "7": 0.6914759874343872, "8": 0.0016543847741559148, "9": 0.0015315841883420944, "10": 0.001425754395313561, "11": 0.0013336046831682324, "12": 0.0012526432983577251, "13": 0.0011809495044872165, "14": 0.001117018167860806, "15": 0.001059653121046722, "16": 0.0010078924242407084, "17": 0.0009609528933651745, "18": 0.0009181909845210612, "19": 0.0008790725260041654}}, {"key": "chen2023diversevul", "year": "2023", "title": "DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection", "topic_distr": {"0": 0.6337292194366455, "1": 0.0011442602844908834, "2": 0.0009672431042417884, "3": 0.07904060930013657, "4": 0.0007388587109744549, "5": 0.2398734837770462, "6": 0.0005977483233436942, "7": 0.0005456439102999866, "8": 0.0005018948577344418, "9": 0.00046464058686979115, "10": 0.00043253469630144536, "11": 0.0004045790119562298, "12": 0.03939812630414963, "13": 0.00035826762905344367, "14": 0.0003388726036064327, "15": 0.00032146964804269373, "16": 0.00030576688004657626, "17": 0.0002915266959462315, "18": 0.0002785538963507861, "19": 0.00026668646023608744}}, {"key": "chen2023supersonic", "year": "2023", "title": "Supersonic: Learning to Generate Source Code Optimizations in C/C++", "topic_distr": {"0": 0.0037814255338162184, "1": 0.003085981123149395, "2": 0.002608593786135316, "3": 0.0022593492176383734, "4": 0.0019926598761230707, "5": 0.0017822838854044676, "6": 0.06737566739320755, "7": 0.0014715680154040456, "8": 0.0013535795733332634, "9": 0.0012531069805845618, "10": 0.0011665194761008024, "11": 0.10252736508846283, "12": 0.0010248839389532804, "13": 0.0009662256925366819, "14": 0.0009139185422100127, "15": 0.0008669838425703347, "16": 0.0008246344514191151, "17": 0.4936380982398987, "18": 0.3103879392147064, "19": 0.0007192369666881859}}, {"key": "chen2024ppm.md", "year": "2024", "title": "PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models", "topic_distr": {"0": 0.0015219004126265645, "1": 0.0012418969999998808, "2": 0.13100676238536835, "3": 0.0009092558175325394, "4": 0.29717743396759033, "5": 0.0007172676268965006, "6": 0.0006487744394689798, "7": 0.0005922222044318914, "8": 0.0005447385483421385, "9": 0.06851287931203842, "10": 0.0004694575327448547, "11": 0.0004391154507175088, "12": 0.00041245733154937625, "13": 0.00038885074900463223, "14": 0.00036780009395442903, "15": 0.0003489115624688566, "16": 0.00033186833024956286, "17": 0.0003164125664625317, "18": 0.0003023323661182076, "19": 0.49374961853027344}}, {"key": "chibotaru2019scalable", "year": "2019", "title": "Scalable Taint Specification Inference with Big Code", "topic_distr": {"0": 0.0017826446564868093, "1": 0.001454740995541215, "2": 0.0012297563953325152, "3": 0.0010651415213942528, "4": 0.0009394132648594677, "5": 0.086277537047863, "6": 0.0007599989767186344, "7": 0.0006937514990568161, "8": 0.0006381273851729929, "9": 0.26072022318840027, "10": 0.0005499403341673315, "11": 0.0005143964663147926, "12": 0.00048316814354620874, "13": 0.5373832583427429, "14": 0.0004308549396228045, "15": 0.00040872819954529405, "16": 0.000388763117371127, "17": 0.02291630581021309, "18": 0.0810241848230362, "19": 0.0003390748461242765}}, {"key": "chirkova2020empirical", "year": "2020", "title": "Empirical Study of Transformers for Source Code", "topic_distr": {"0": 0.1971534788608551, "1": 0.0016165097476914525, "2": 0.001366399577818811, "3": 0.2902964949607849, "4": 0.0010437984019517899, "5": 0.0009336000075563788, "6": 0.0008444486302323639, "7": 0.0007708398625254631, "8": 0.0007090349099598825, "9": 0.0006564051727764308, "10": 0.0006110486574470997, "11": 0.0005715552251785994, "12": 0.0005368568818084896, "13": 0.0005061303963884711, "14": 0.00047873074072413146, "15": 0.0004541453090496361, "16": 0.35593828558921814, "17": 0.0004118444339837879, "18": 0.0003935175482183695, "19": 0.1447068452835083}}, {"key": "chirkova2021embeddings", "year": "2021", "title": "On the Embeddings of Variables in Recurrent Neural Networks for Source Code", "topic_distr": {"0": 0.0025988086126744747, "1": 0.002121812431141734, "2": 0.001793383969925344, "3": 0.001553306938149035, "4": 0.0013699601404368877, "5": 0.0012253287713974714, "6": 0.001108319964259863, "7": 0.0010117101483047009, "8": 0.0009305924759246409, "9": 0.0008615170954726636, "10": 0.0008019877132028341, "11": 0.0007501535001210868, "12": 0.2185562402009964, "13": 0.22445981204509735, "14": 0.0006283234106376767, "15": 0.0005960555863566697, "16": 0.538081169128418, "17": 0.0005405366537161171, "18": 0.0005164829781278968, "19": 0.0004944788524881005}}, {"key": "chow2023beware", "year": "2023", "title": "Beware of the Unexpected: Bimodal Taint Analysis", "topic_distr": {"0": 0.27222803235054016, "1": 0.0011192484525963664, "2": 0.000945979030802846, "3": 0.0008193546673282981, "4": 0.0007226401357911527, "5": 0.16344819962978363, "6": 0.0005846275598742068, "7": 0.06035266071557999, "8": 0.0004908780683763325, "9": 0.4959842562675476, "10": 0.0004230403865221888, "11": 0.00039569835644215345, "12": 0.000371676025679335, "13": 0.0003504035121295601, "14": 0.0003314342175144702, "15": 0.00031441327882930636, "16": 0.000299055187497288, "17": 0.00028512757853604853, "18": 0.0002724395308177918, "19": 0.00026083257398568094}}, {"key": "ciurumelea2020suggesting", "year": "2020", "title": "Suggesting Comment Completions for Python using Neural Language Models", "topic_distr": {"0": 0.0018636283930391073, "1": 0.0015201374189928174, "2": 0.0012848502956330776, "3": 0.0011128297774121165, "4": 0.0009814712684601545, "5": 0.0008778537157922983, "6": 0.0007940260693430901, "7": 0.0007248125039041042, "8": 0.0006666979752480984, "9": 0.0006172108114697039, "10": 0.7201627492904663, "11": 0.0005374273168854415, "12": 0.0005048008169978857, "13": 0.0004759090079460293, "14": 0.10015977174043655, "15": 0.00042702798964455724, "16": 0.00040616904152557254, "17": 0.00038725294871255755, "18": 0.00037002036697231233, "19": 0.16612540185451508}}, {"key": "clement2020pymt5", "year": "2020", "title": "PyMT5: multi-mode translation of natural language and Python code with transformers", "topic_distr": {"0": 0.0019194015767425299, "1": 0.0015665441751480103, "2": 0.22276169061660767, "3": 0.0011470563476905227, "4": 0.2555106580257416, "5": 0.0009048546198755503, "6": 0.0008184482576325536, "7": 0.0007471059216186404, "8": 0.0006872038939036429, "9": 0.0006361945997923613, "10": 0.0005922346608713269, "11": 0.0005539571866393089, "12": 0.0005203271866776049, "13": 0.0004905467503704131, "14": 0.5091392993927002, "15": 0.00044016228639520705, "16": 0.0004186617734376341, "17": 0.00039916386594995856, "18": 0.0003814012452494353, "19": 0.00036515211104415357}}, {"key": "clement2021distilling", "year": "2021", "title": "Distilling Transformers for Neural Cross-Domain Search", "topic_distr": {"0": 0.349011093378067, "1": 0.226850226521492, "2": 0.0017934583593159914, "3": 0.001553339185193181, "4": 0.20125357806682587, "5": 0.001225347281433642, "6": 0.0011083370773121715, "7": 0.0010117256315425038, "8": 0.0009306067368015647, "9": 0.000861530308611691, "10": 0.0008020000532269478, "11": 0.0007501649670302868, "12": 0.0007046234677545726, "13": 0.05706406757235527, "14": 0.04537669196724892, "15": 0.0005960647249594331, "16": 0.0005669488455168903, "17": 0.0005405449192039669, "18": 0.0005164909525774419, "19": 0.10748318582773209}}, {"key": "clement2021long", "year": "2021", "title": "Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy", "topic_distr": {"0": 0.0016871478874236345, "1": 0.0013760353904217482, "2": 0.0011632733512669802, "3": 0.0010075579630210996, "4": 0.276165246963501, "5": 0.0007948109996505082, "6": 0.0007189130992628634, "7": 0.000656247022561729, "8": 0.0006036299746483564, "9": 0.0005588241619989276, "10": 0.25747987627983093, "11": 0.00048658798914402723, "12": 0.07398940622806549, "13": 0.00043088916572742164, "14": 0.21418000757694244, "15": 0.0003866321931127459, "16": 0.0003677464264910668, "17": 0.00035061975358985364, "18": 0.000335017335601151, "19": 0.1672615259885788}}, {"key": "commit2vec2019lozoya", "year": "2019", "title": "Commit2Vec: Learning Distributed Representations of Code Changes", "topic_distr": {"0": 0.3049479126930237, "1": 0.0013950756983831525, "2": 0.0011792208533734083, "3": 0.12463096529245377, "4": 0.1201421245932579, "5": 0.0008057016530074179, "6": 0.12914793193340302, "7": 0.0006652391166426241, "8": 0.0006119011086411774, "9": 0.0005664813215844333, "10": 0.0005273384158499539, "11": 0.034302689135074615, "12": 0.2784425616264343, "13": 0.00043679337250068784, "14": 0.0004131473251618445, "15": 0.00039192993426695466, "16": 0.0003727854054886848, "17": 0.00035542406840249896, "18": 0.00033960785367526114, "19": 0.0003251392627134919}}, {"key": "compton2020embedding", "year": "2020", "title": "Embedding Java Classes with code2vec: Improvements from Variable Obfuscation", "topic_distr": {"0": 0.13614457845687866, "1": 0.0010609731543809175, "2": 0.1521930694580078, "3": 0.13068121671676636, "4": 0.0006850044010207057, "5": 0.0006126853404566646, "6": 0.0005541789578273892, "7": 0.0005058723618276417, "8": 0.00046531218686141074, "9": 0.16778895258903503, "10": 0.00040100759360939264, "11": 0.00037508958484977484, "12": 0.04413089528679848, "13": 0.2473706752061844, "14": 0.00031417247373610735, "15": 0.00029803800862282515, "16": 0.11564253270626068, "17": 0.0002702775818761438, "18": 0.00025825033662840724, "19": 0.0002472478954587132}}, {"key": "corley2015exploring", "year": "2015", "title": "Exploring the Use of Deep Learning for Feature Location", "topic_distr": {"0": 0.002114138100296259, "1": 0.0017265239730477333, "2": 0.0014590485952794552, "3": 0.7595317959785461, "4": 0.0011145435273647308, "5": 0.0009968768572434783, "6": 0.0009016835247166455, "7": 0.0008230857201851904, "8": 0.0007570917950943112, "9": 0.0007008949178270996, "10": 0.0006524642230942845, "11": 0.0006102939951233566, "12": 0.00057324388762936, "13": 0.14743225276470184, "14": 0.0005111781065352261, "15": 0.0004849263059441, "16": 0.07834775000810623, "17": 0.0004397583834361285, "18": 0.00042018931708298624, "19": 0.00040228766738437116}}, {"key": "cummins2017end", "year": "2017", "title": "End-to-end Deep Learning of Optimization Heuristics", "topic_distr": {"0": 0.0015997372101992369, "1": 0.20658142864704132, "2": 0.10455380380153656, "3": 0.0009559126337990165, "4": 0.000843074347358197, "5": 0.000754066975787282, "6": 0.0006820598500780761, "7": 0.032617535442113876, "8": 0.043258897960186005, "9": 0.0005301774363033473, "10": 0.0004935431061312556, "11": 0.0004616442893166095, "12": 0.00043361849384382367, "13": 0.0004088007553946227, "14": 0.000386670115403831, "15": 0.00036681248457171023, "16": 0.0003488948568701744, "17": 0.4412587285041809, "18": 0.1631603240966797, "19": 0.0003043022006750107}}, {"key": "cummins2017synthesizing", "year": "2017", "title": "Synthesizing benchmarks for predictive modeling", "topic_distr": {"0": 0.001313594519160688, "1": 0.0010719354031607509, "2": 0.000906136236153543, "3": 0.0007848312961868942, "4": 0.0006921899621374905, "5": 0.0006191125721670687, "6": 0.0005599924479611218, "7": 0.0005111790960654616, "8": 0.48041224479675293, "9": 0.0004352922551333904, "10": 0.0004052142903674394, "11": 0.0003790243936236948, "12": 0.000356014323187992, "13": 0.00033563817851245403, "14": 0.00031746822060085833, "15": 0.00030116448760963976, "16": 0.0002864535781554878, "17": 0.42306429147720337, "18": 0.08699838072061539, "19": 0.0002498415997251868}}, {"key": "cummins2018compiler", "year": "2018", "title": "Compiler Fuzzing through Deep Learning", "topic_distr": {"0": 0.0016631261678412557, "1": 0.0013577856589108706, "2": 0.0011478213127702475, "3": 0.000994123867712915, "4": 0.0008767833933234215, "5": 0.9874476194381714, "6": 0.000709329207893461, "7": 0.0006474985275417566, "8": 0.0005955828819423914, "9": 0.0005513743963092566, "10": 0.0005132753285579383, "11": 0.0004801012109965086, "12": 0.0004509549180511385, "13": 0.0004251449427101761, "14": 0.0004021294880658388, "15": 0.0003814779338426888, "16": 0.0003628439735621214, "17": 0.00034594559110701084, "18": 0.00033055117819458246, "19": 0.0003164684458170086}}, {"key": "cummins2020programl", "year": "2020", "title": "ProGraML: Graph-based Deep Learning for Program Optimization and Analysis", "topic_distr": {"0": 0.23154021799564362, "1": 0.0010832620318979025, "2": 0.0009157946915365756, "3": 0.0007931998698040843, "4": 0.017936376854777336, "5": 0.0006257071509025991, "6": 0.0005659573362208903, "7": 0.0005166240152902901, "8": 0.0004752017557621002, "9": 0.0821821540594101, "10": 0.00040953047573566437, "11": 0.022235609591007233, "12": 0.39384961128234863, "13": 0.00033921326394192874, "14": 0.00032084976555779576, "15": 0.00030437237001024187, "16": 0.0002895047655329108, "17": 0.24510060250759125, "18": 0.00026373908622190356, "19": 0.0002525028248783201}}, {"key": "cvitkovic2018open", "year": "2018", "title": "Open Vocabulary Learning on Source Code with a Graph-Structured Cache", "topic_distr": {"0": 0.27201002836227417, "1": 0.0021670174319297075, "2": 0.001831602887250483, "3": 0.0015863876324146986, "4": 0.0013991338200867176, "5": 0.03441820293664932, "6": 0.19088785350322723, "7": 0.0010332530364394188, "8": 0.0009504081099294126, "9": 0.0008798618800938129, "10": 0.040652260184288025, "11": 0.0007661269046366215, "12": 0.17528486251831055, "13": 0.27271905541419983, "14": 0.0006417026743292809, "15": 0.0006087477086111903, "16": 0.0005790123250335455, "17": 0.0005520465783774853, "18": 0.0005274807917885482, "19": 0.0005050080944783986}}, {"key": "dam2016deep", "year": "2016", "title": "A deep language model for software code", "topic_distr": {"0": 0.003121335990726948, "1": 0.002546119038015604, "2": 0.846623420715332, "3": 0.0018639969639480114, "4": 0.0016439742175862193, "5": 0.001470412127673626, "6": 0.0013299999991431832, "7": 0.0012140667531639338, "8": 0.0011167244520038366, "9": 0.0010338330175727606, "10": 0.13148176670074463, "11": 0.0009001950384117663, "12": 0.0008455453789792955, "13": 0.0007971514132805169, "14": 0.0007539971848018467, "15": 0.0007152752950787544, "16": 0.0006803363794460893, "17": 0.0006486517959274352, "18": 0.000619787082541734, "19": 0.0005933818174526095}}, {"key": "dash2018refinym", "year": "2018", "title": "RefiNym: Using Names to Refine Types", "topic_distr": {"0": 0.0031983046792447567, "1": 0.002610862720757723, "2": 0.0022071583662182093, "3": 0.0019117393530905247, "4": 0.0016860822215676308, "5": 0.0015080762095749378, "6": 0.001364067429676652, "7": 0.001245164661668241, "8": 0.0011453289771452546, "9": 0.8495875000953674, "10": 0.0009870483772829175, "11": 0.0009232531883753836, "12": 0.0008672037511132658, "13": 0.12664401531219482, "14": 0.0007733105449005961, "15": 0.0007335968548431993, "16": 0.0006977629382163286, "17": 0.000665266765281558, "18": 0.000635662698186934, "19": 0.0006085810600779951}}, {"key": "david2019neural", "year": "2019", "title": "Neural Reverse Engineering of Stripped Binaries", "topic_distr": {"0": 0.0022289371117949486, "1": 0.0018185402732342482, "2": 0.0015371956396847963, "3": 0.0013314266689121723, "4": 0.0011742666829377413, "5": 0.00105029356200248, "6": 0.0009499990846961737, "7": 0.0008671897230669856, "8": 0.0007976595661602914, "9": 0.4666330814361572, "10": 0.0006874257232993841, "11": 0.0006429958739317954, "12": 0.1393643170595169, "13": 0.2517140507698059, "14": 0.0005385689437389374, "15": 0.000510910467710346, "16": 0.0004859541077166796, "17": 0.12680061161518097, "18": 0.000442704651504755, "19": 0.00042384376865811646}}, {"key": "defreez2018path", "year": "2018", "title": "Path-Based Function Embedding and its Application to Specification Mining", "topic_distr": {"0": 0.0021519025322049856, "1": 0.09852743148803711, "2": 0.001484316773712635, "3": 0.11290596425533295, "4": 0.0011337955947965384, "5": 0.11980891972780228, "6": 0.04849536716938019, "7": 0.44312751293182373, "8": 0.0007701672147959471, "9": 0.0007129997829906642, "10": 0.0006637327023781836, "11": 0.0006208341801539063, "12": 0.0005831441376358271, "13": 0.000549768446944654, "14": 0.0005200064624659717, "15": 0.0004933012533001602, "16": 0.0004692050570156425, "17": 0.0004473532608244568, "18": 0.16612501442432404, "19": 0.000409235421102494}}, {"key": "derezendemartins2020concra.md", "year": "2020", "title": "CoNCRA: A Convolutional Neural Network Code Retrieval Approach", "topic_distr": {"0": 0.0024452742654830217, "1": 0.6089854836463928, "2": 0.0016879364848136902, "3": 0.0014619462890550494, "4": 0.0012893732637166977, "5": 0.0011532497592270374, "6": 0.0010431240079924464, "7": 0.0009521971805952489, "8": 0.0008758511976338923, "9": 0.0008108391775749624, "10": 0.0007548115099780262, "11": 0.000706026388797909, "12": 0.0006631644791923463, "13": 0.0006252088933251798, "14": 0.0005913628847338259, "15": 0.0005609931540675461, "16": 0.0005335904425010085, "17": 0.0005087400786578655, "18": 0.0004861013439949602, "19": 0.3738647699356079}}, {"key": "devanbu2020deep", "year": "2020", "title": "Deep Learning & Software Engineering: State of Research and Future Directions", "topic_distr": {"0": 0.9734703302383423, "1": 0.0030853571370244026, "2": 0.0026086049620062113, "3": 0.002259355504065752, "4": 0.0019926626700907946, "5": 0.0017822925001382828, "6": 0.0016120979562401772, "7": 0.0014715747674927115, "8": 0.001353585859760642, "9": 0.0012531128013506532, "10": 0.0011665248312056065, "11": 0.0010911297285929322, "12": 0.0010248887119814754, "13": 0.0009662301745265722, "14": 0.0009139227913692594, "15": 0.0008669878588989377, "16": 0.0008246382349170744, "17": 0.000786233227699995, "18": 0.0007512462325394154, "19": 0.000719240284524858}}, {"key": "devlin2017semantic", "year": "2017", "title": "Semantic Code Repair using Neuro-Symbolic Transformation Networks", "topic_distr": {"0": 0.001578795607201755, "1": 0.0012890896759927273, "2": 0.04834555834531784, "3": 0.06717519462108612, "4": 0.0008323897491209209, "5": 0.0007445113733410835, "6": 0.0006734167691320181, "7": 0.0006147164385765791, "8": 0.0005654292763210833, "9": 0.0005234589916653931, "10": 0.0004872888675890863, "11": 0.0004557943029794842, "12": 0.12953884899616241, "13": 0.0004036203899886459, "14": 0.2175041288137436, "15": 0.5279805660247803, "16": 0.0003444736357778311, "17": 0.00032843078952282667, "18": 0.0003138157771900296, "19": 0.0003004460595548153}}, {"key": "deze2021mulcode", "year": "2021", "title": "MulCode: A Multi-task Learning Approach for Source Code Understanding", "topic_distr": {"0": 0.07758750766515732, "1": 0.0016696001403033733, "2": 0.0014112319331616163, "3": 0.18955039978027344, "4": 0.2682070732116699, "5": 0.0009642158402130008, "6": 0.0008721413323655725, "7": 0.12440717220306396, "8": 0.0007322868332266808, "9": 0.0006779311806894839, "10": 0.0006310872850008309, "11": 0.0005902987322770059, "12": 0.2807376980781555, "13": 0.0005227283108979464, "14": 0.000494430132675916, "15": 0.0004690384666901082, "16": 0.00044612743658944964, "17": 0.0004253503866493702, "18": 0.0004064224776811898, "19": 0.04919726774096489}}, {"key": "deze2022bridging", "year": "2022", "title": "Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding", "topic_distr": {"0": 0.0022272628266364336, "1": 0.07991957664489746, "2": 0.0015371942427009344, "3": 0.11622964590787888, "4": 0.48788267374038696, "5": 0.0010502820368856192, "6": 0.0009499887237325311, "7": 0.0008671802352182567, "8": 0.3032287657260895, "9": 0.0007384433993138373, "10": 0.0006874181563034654, "11": 0.0006429888308048248, "12": 0.000603953842073679, "13": 0.0005693871062248945, "14": 0.0005385630065575242, "15": 0.0005109048797748983, "16": 0.000485948781715706, "17": 0.00046331717749126256, "18": 0.00044269979116506875, "19": 0.0004238391120452434}}, {"key": "dinella2020hoppity", "year": "2020", "title": "Hoppity: Learning Bug Detection and Repair", "topic_distr": {"0": 0.002043989486992359, "1": 0.0016694639343768358, "2": 0.0014112211065366864, "3": 0.0012223131489008665, "4": 0.0010780205484479666, "5": 0.18820810317993164, "6": 0.0008721345802769065, "7": 0.0007961125229485333, "8": 0.0007322812452912331, "9": 0.139256551861763, "10": 0.0006310823955573142, "11": 0.03634815663099289, "12": 0.15797965228557587, "13": 0.0005227242945693433, "14": 0.0004944263491779566, "15": 0.3412266969680786, "16": 0.00044612400233745575, "17": 0.12426544725894928, "18": 0.0004064193635713309, "19": 0.0003891043597832322}}, {"key": "dinella2021deepmerge", "year": "2021", "title": "DeepMerge: Learning to Merge Programs", "topic_distr": {"0": 0.0021900932770222425, "1": 0.11105068773031235, "2": 0.11097457259893417, "3": 0.001308081904426217, "4": 0.001153677818365395, "5": 0.46719083189964294, "6": 0.0009333440102636814, "7": 0.0008519864059053361, "8": 0.11317036300897598, "9": 0.0007255051168613136, "10": 0.0006753739435225725, "11": 0.13209104537963867, "12": 0.0005933719803579152, "13": 0.0005594108952209353, "14": 0.0005291269044391811, "15": 0.05421854555606842, "16": 0.00047743451432324946, "17": 0.00045519942068494856, "18": 0.00043494327110238373, "19": 0.0004164130368735641}}, {"key": "dinella2022toga", "year": "2022", "title": "TOGA: A Neural Method for Test Oracle Generation", "topic_distr": {"0": 0.0013852992560714483, "1": 0.0011317377211526036, "2": 0.7136439085006714, "3": 0.0008284311043098569, "4": 0.0007306470070034266, "5": 0.2768527567386627, "6": 0.0005911043263040483, "7": 0.0005395790212787688, "8": 0.000496316293720156, "9": 0.0004594760830514133, "10": 0.0004277270345482975, "11": 0.0004000820918008685, "12": 0.0003757936356123537, "13": 0.0003542854683473706, "14": 0.00033510601497255266, "15": 0.00031789648346602917, "16": 0.0003023682511411607, "17": 0.0002882863627746701, "18": 0.00027545777265913785, "19": 0.0002637222351040691}}, {"key": "ding2019asm2vec", "year": "2019", "title": "Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization", "topic_distr": {"0": 0.0011553327785804868, "1": 0.0009430405334569514, "2": 0.000797071261331439, "3": 0.3766096234321594, "4": 0.0006088808877393603, "5": 0.1688503921031952, "6": 0.0004925943212583661, "7": 0.0004496559267863631, "8": 0.06764073669910431, "9": 0.0003829024499282241, "10": 0.0003564445360098034, "11": 0.00033340672962367535, "12": 0.0003131660632789135, "13": 0.0002952422946691513, "14": 0.0002792591985780746, "15": 0.0002649177040439099, "16": 0.0002519773261155933, "17": 0.3172796070575714, "18": 0.00022955157328397036, "19": 0.062466204166412354}}, {"key": "ding2021contrastive", "year": "2021", "title": "Contrastive Learning for Source Code with Structural and Functional Properties", "topic_distr": {"0": 0.0019201850518584251, "1": 0.0015666830586269498, "2": 0.001324363169260323, "3": 0.0011470681056380272, "4": 0.3462819755077362, "5": 0.0009048579377122223, "6": 0.0008184515754692256, "7": 0.000747108890209347, "8": 0.539134681224823, "9": 0.0006361971609294415, "10": 0.0005922369891777635, "11": 0.0005539593985304236, "12": 0.10141314566135406, "13": 0.0004905487294308841, "14": 0.000463992590084672, "15": 0.0004401640617288649, "16": 0.00041866343235597014, "17": 0.0003991654375568032, "18": 0.00038140275864861906, "19": 0.0003651535662356764}}, {"key": "ding2023static", "year": "2023", "title": "A Static Evaluation of Code Completion by Large Language Models", "topic_distr": {"0": 0.29353588819503784, "1": 0.0013055771123617887, "2": 0.0011036749929189682, "3": 0.0009558993624523282, "4": 0.0008430638699792325, "5": 0.0007540570804849267, "6": 0.0006820508278906345, "7": 0.13617339730262756, "8": 0.000572678807657212, "9": 0.4085389971733093, "10": 0.15217356383800507, "11": 0.00046163814840838313, "12": 0.00043361273128539324, "13": 0.0004087953129783273, "14": 0.00038666496402584016, "15": 0.0003668075951281935, "16": 0.0003488902293611318, "17": 0.00033264170633628964, "18": 0.0003178392944391817, "19": 0.0003042981552425772}}, {"key": "doderlein2022piloting", "year": "2022", "title": "Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic?", "topic_distr": {"0": 0.34882959723472595, "1": 0.15312977135181427, "2": 0.001117955194786191, "3": 0.000968306150753051, "4": 0.19966815412044525, "5": 0.0007638490642420948, "6": 0.0006909078801982105, "7": 0.000630682916380465, "8": 0.0005801155348308384, "9": 0.0005370551371015608, "10": 0.0004999455413781106, "11": 0.00046763295540586114, "12": 0.00043924356577917933, "13": 0.00041410388075746596, "14": 0.00039168616058304906, "15": 0.00037157093174755573, "16": 0.00035342088085599244, "17": 0.1323406994342804, "18": 0.053461067378520966, "19": 0.1043441966176033}}, {"key": "dong2023codescore", "year": "2023", "title": "CodeScore: Evaluating Code Generation by Learning Code Execution", "topic_distr": {"0": 0.16561131179332733, "1": 0.0014143302105367184, "2": 0.47642552852630615, "3": 0.001035547349601984, "4": 0.0009133120765909553, "5": 0.0008168890490196645, "6": 0.0007388829835690558, "7": 0.0006744762067683041, "8": 0.0006203975644893944, "9": 0.0005743471556343138, "10": 0.0005346607067622244, "11": 0.0005001043900847435, "12": 0.0004697437398135662, "13": 0.00044285840704105794, "14": 0.00041888401028700173, "15": 0.3473964035511017, "16": 0.00037796166725456715, "17": 0.0003603592631407082, "18": 0.0003443234309088439, "19": 0.0003296539653092623}}, {"key": "drain2021deepdebug", "year": "2021", "title": "DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons", "topic_distr": {"0": 0.27707719802856445, "1": 0.0014547912869602442, "2": 0.18918542563915253, "3": 0.0010651404736563563, "4": 0.000939417106565088, "5": 0.0008402387029491365, "6": 0.040583349764347076, "7": 0.00069375493330881, "8": 0.0006381305865943432, "9": 0.0005907638696953654, "10": 0.0005499430699273944, "11": 0.0005143990274518728, "12": 0.00048317055916413665, "13": 0.00045551673974841833, "14": 0.0004308570933062583, "15": 0.31482261419296265, "16": 0.16861136257648468, "17": 0.00037065951619297266, "18": 0.00035416532773524523, "19": 0.000339076534146443}}, {"key": "drain2021generating", "year": "2021", "title": "Generating Bug-Fixes Using Pretrained Transformers", "topic_distr": {"0": 0.0013412319822236896, "1": 0.0010948957642540336, "2": 0.1844622790813446, "3": 0.03930656239390373, "4": 0.07653806358575821, "5": 0.0006324322894215584, "6": 0.0005720401532016695, "7": 0.0005221766186878085, "8": 0.07158108800649643, "9": 0.032628048211336136, "10": 0.00041393208084627986, "11": 0.0003871787339448929, "12": 0.0003636736364569515, "13": 0.0003428591007832438, "14": 0.030176999047398567, "15": 0.40247341990470886, "16": 0.156362384557724, "17": 0.0002789886202663183, "18": 0.0002665737411007285, "19": 0.0002552166988607496}}, {"key": "edelmann2019neural", "year": "2019", "title": "Neural-Network Guided Expression Transformation", "topic_distr": {"0": 0.0024449792690575123, "1": 0.7410998940467834, "2": 0.0016878575552254915, "3": 0.0014619515277445316, "4": 0.0012893748935312033, "5": 0.001153249992057681, "6": 0.0010431240079924464, "7": 0.0009521972388029099, "8": 0.0008758513140492141, "9": 0.0008108392357826233, "10": 0.0007548115681856871, "11": 0.0007060264470055699, "12": 0.0006631645374000072, "13": 0.0006252089515328407, "14": 0.0005913628847338259, "15": 0.000560993212275207, "16": 0.0005335904425010085, "17": 0.24179399013519287, "18": 0.0004861014022026211, "19": 0.0004653916403185576}}, {"key": "ederhardt2019unsupervised", "year": "2019", "title": "Unsupervised Learning of API Aliasing Specifications", "topic_distr": {"0": 0.0015213475562632084, "1": 0.001241855090484023, "2": 0.001049787737429142, "3": 0.0009092605323530734, "4": 0.0008019335800781846, "5": 0.0007172708283178508, "6": 0.0006487772916443646, "7": 0.0005922247655689716, "8": 0.000544740934856236, "9": 0.0005043062847107649, "10": 0.0004694595991168171, "11": 0.00043911737157031894, "12": 0.00041245913598686457, "13": 0.9881907105445862, "14": 0.00036780169466510415, "15": 0.0003489130758680403, "16": 0.0003318697854410857, "17": 0.00031641393434256315, "18": 0.0003023336757905781, "19": 0.0002894531062338501}}, {"key": "efstathiou2019semantic", "year": "2019", "title": "Semantic Source Code Models Using Identifier Embeddings", "topic_distr": {"0": 0.5286051630973816, "1": 0.001497821998782456, "2": 0.05641859397292137, "3": 0.2034681886434555, "4": 0.0009670521249063313, "5": 0.0008649560622870922, "6": 0.1415601670742035, "7": 0.06093140318989754, "8": 0.0006569027318619192, "9": 0.0006081425817683339, "10": 0.000566120957955718, "11": 0.0005295313312672079, "12": 0.0004973841714672744, "13": 0.00046891687088645995, "14": 0.00044353181147016585, "15": 0.00042075401870533824, "16": 0.00040020153392106295, "17": 0.0003815633535850793, "18": 0.0003645839460659772, "19": 0.00034905128995887935}}, {"key": "eghbali2022crystalbleu", "year": "2022", "title": "CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code", "topic_distr": {"0": 0.061734769493341446, "1": 0.001144320354796946, "2": 0.0009672220912761986, "3": 0.0008377557387575507, "4": 0.0007388639496639371, "5": 0.000660858815535903, "6": 0.0005977523396722972, "7": 0.0005456475773826241, "8": 0.0005018982337787747, "9": 0.00046464367187581956, "10": 0.03414887562394142, "11": 0.0004045817186124623, "12": 0.00038002009387128055, "13": 0.0003582700155675411, "14": 0.02986810728907585, "15": 0.7293069362640381, "16": 0.00030576891731470823, "17": 0.0002915286459028721, "18": 0.0002785557589959353, "19": 0.13646367192268372}}, {"key": "ellis2021dreamcoder", "year": "2021", "title": "DreamCoder: bootstrapping inductive program synthesis with wake-sleep library learning", "topic_distr": {"0": 0.0023523876443505287, "1": 0.32100391387939453, "2": 0.0016241332050412893, "3": 0.001406748779118061, "4": 0.0012407016474753618, "5": 0.0011097159003838897, "6": 0.2805887758731842, "7": 0.0009162528440356255, "8": 0.0008427888969890773, "9": 0.0007802309701219201, "10": 0.0007263182778842747, "11": 0.0006793747306801379, "12": 0.0006381308194249868, "13": 0.0006016080151312053, "14": 0.0005690396064892411, "15": 0.000539816333912313, "16": 0.0005134480306878686, "17": 0.000489535741508007, "18": 0.3829292953014374, "19": 0.00044782363693229854}}, {"key": "elnaggar2021codetrans", "year": "2021", "title": "CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing", "topic_distr": {"0": 0.7645838260650635, "1": 0.0017556650564074516, "2": 0.0014842059463262558, "3": 0.001285500475205481, "4": 0.22145520150661469, "5": 0.0010140646481886506, "6": 0.0009172299178317189, "7": 0.0008372769225388765, "8": 0.0007701451540924609, "9": 0.0007129793521016836, "10": 0.0006637136684730649, "11": 0.0006208163686096668, "12": 0.0005831274902448058, "13": 0.0005497527308762074, "14": 0.0005199915613047779, "15": 0.0004932871670462191, "16": 0.00046919164014980197, "17": 0.00044734045513905585, "18": 0.00042743401718325913, "19": 0.0004092237213626504}}, {"key": "eniser2023automatically", "year": "2023", "title": "Automatically Testing Functional Properties of Code Translation Models", "topic_distr": {"0": 0.001524079474620521, "1": 0.0012420096900314093, "2": 0.0010498319752514362, "3": 0.0009092854452319443, "4": 0.0008019437664188445, "5": 0.0007172786281444132, "6": 0.0006487845093943179, "7": 0.0005922313430346549, "8": 0.000544746988452971, "9": 0.0005043118726462126, "10": 0.0004694647795986384, "11": 0.0004391222319100052, "12": 0.0004124637052882463, "13": 0.0003888567443937063, "14": 0.38512372970581055, "15": 0.00034891694667749107, "16": 0.00033187345252372324, "17": 0.6033592820167542, "18": 0.00030233702273108065, "19": 0.00028945630765520036}}, {"key": "feng2020codebert", "year": "2020", "title": "CodeBERT: A Pre-Trained Model for Programming and Natural Languages", "topic_distr": {"0": 0.2719588875770569, "1": 0.09400292485952377, "2": 0.001434714300557971, "3": 0.06477662175893784, "4": 0.28674253821372986, "5": 0.0009802681161090732, "6": 0.0008866605930961668, "7": 0.000809372344519943, "8": 0.06864766776561737, "9": 0.024669654667377472, "10": 0.000641593593172729, "11": 0.0006001259316690266, "12": 0.0005636931164190173, "13": 0.0005314306472428143, "14": 0.0005026613944210112, "15": 0.0004768469661939889, "16": 0.00045355450129136443, "17": 0.00043243158143013716, "18": 0.0004131885652896017, "19": 0.18047519028186798}}, {"key": "fernandes2019structured", "year": "2019", "title": "Structured Neural Summarization", "topic_distr": {"0": 0.0027128641959279776, "1": 0.09432504326105118, "2": 0.00187137839384377, "3": 0.0016208878951147199, "4": 0.0014295618748292327, "5": 0.0012786349980160594, "6": 0.15184292197227478, "7": 0.00105572328902781, "8": 0.0009710767189972103, "9": 0.0008989963098429143, "10": 0.0008368772105313838, "11": 0.0007827879744581878, "12": 0.3043804466724396, "13": 0.0006931836833246052, "14": 0.0006556578446179628, "15": 0.0006219862261787057, "16": 0.0005916041554883122, "17": 0.0005640519666485488, "18": 0.000538951950147748, "19": 0.43232738971710205}}, {"key": "fowkes2016parameter", "year": "2016", "title": "Parameter-Free Probabilistic API Mining across GitHub", "topic_distr": {"0": 0.002079138532280922, "1": 0.0016972250305116177, "2": 0.0014346728567034006, "3": 0.0012426358880475163, "4": 0.0010959607316181064, "5": 0.0009802558925002813, "6": 0.000886649708263576, "7": 0.0008093625074252486, "8": 0.000744468707125634, "9": 0.0006892087985761464, "10": 0.0006415856769308448, "11": 0.0006001185392960906, "12": 0.0005636861314997077, "13": 0.9838607907295227, "14": 0.0005026551662012935, "15": 0.00047684108722023666, "16": 0.00045354891335591674, "17": 0.0004324262554291636, "18": 0.00041318347211927176, "19": 0.00039558031130582094}}, {"key": "fowkes2017autofolding", "year": "2017", "title": "Autofolding for Source Code Summarization", "topic_distr": {"0": 0.002712604124099016, "1": 0.002213807078078389, "2": 0.0018715002806857228, "3": 0.0016208700835704803, "4": 0.0014295452274382114, "5": 0.0012786219595000148, "6": 0.00115652394015342, "7": 0.07320626825094223, "8": 0.0009710665908642113, "9": 0.0008989869384095073, "10": 0.35562118887901306, "11": 0.000782779767177999, "12": 0.0007352582761086524, "13": 0.000693176465574652, "14": 0.000655650976113975, "15": 0.0006219797069206834, "16": 0.0005915979854762554, "17": 0.0005640460876747966, "18": 0.1237853541970253, "19": 0.4285891354084015}}, {"key": "franks2015cacheca", "year": "2015", "title": "CACHECA: A Cache Language Model Based Code Suggestion Tool", "topic_distr": {"0": 0.14414134621620178, "1": 0.0027520416770130396, "2": 0.002326542278751731, "3": 0.002015121281147003, "4": 0.001777259400114417, "5": 0.0015896281693130732, "6": 0.0014378316700458527, "7": 0.0013124990509822965, "8": 0.0012072644894942641, "9": 0.1613740473985672, "10": 0.5391808152198792, "11": 0.000973179645370692, "12": 0.0009140992187894881, "13": 0.1346617043018341, "14": 0.0008151286165229976, "15": 0.0007732672966085374, "16": 0.0007354956469498575, "17": 0.0007012421847321093, "18": 0.0006700372323393822, "19": 0.0006414910894818604}}, {"key": "fried2022incoder", "year": "2022", "title": "InCoder: A Generative Model for Code Infilling and Synthesis", "topic_distr": {"0": 0.17570388317108154, "1": 0.0018518088618293405, "2": 0.0015651634894311428, "3": 0.0013556479243561625, "4": 0.1602182686328888, "5": 0.0010694044176489115, "6": 0.0009672852465882897, "7": 0.0008829690632410347, "8": 0.0008121737046167254, "9": 0.3773057460784912, "10": 0.000699934083968401, "11": 0.000654695788398385, "12": 0.0006149500841274858, "13": 0.0005797540070489049, "14": 0.0005483686691150069, "15": 0.0005202069296501577, "16": 0.0004947964916937053, "17": 0.00047175283543765545, "18": 0.19948026537895203, "19": 0.07420288026332855}}, {"key": "fu2019coda", "year": "2019", "title": "Coda: An End-to-End Neural Program Decompiler", "topic_distr": {"0": 0.1698305755853653, "1": 0.0011842962121590972, "2": 0.0010009787511080503, "3": 0.0008669791277498007, "4": 0.0852847620844841, "5": 0.12561899423599243, "6": 0.0006186059326864779, "7": 0.04753847420215607, "8": 0.0005194077966734767, "9": 0.0004808535741176456, "10": 0.0004476273898035288, "11": 0.0004186962323728949, "12": 0.32453081011772156, "13": 0.00037076888838782907, "14": 0.00035069711157120764, "15": 0.054826099425554276, "16": 0.00031643619877286255, "17": 0.00030169912497512996, "18": 0.1852172464132309, "19": 0.0002759921189863235}}, {"key": "gao2019neural", "year": "2019", "title": "A Neural Model for Method Name Generation from Functional Description", "topic_distr": {"0": 0.0011665665078908205, "1": 0.10856654495000839, "2": 0.1397399604320526, "3": 0.0006968271336518228, "4": 0.0006145758088678122, "5": 0.0005496916710399091, "6": 0.0004972007591277361, "7": 0.00045386081910692155, "8": 0.09876526892185211, "9": 0.00038648309418931603, "10": 0.00035977776860818267, "11": 0.00033652453566901386, "12": 0.0003160945780109614, "13": 0.40597328543663025, "14": 0.00028187065618112683, "15": 0.00026739505119621754, "16": 0.00025433365954086185, "17": 0.00024248882255051285, "18": 0.00023169818450696766, "19": 0.2402995526790619}}, {"key": "garg2022deepperf", "year": "2022", "title": "DeepPERF: A Deep Learning-Based Approach For Improving Software Performance", "topic_distr": {"0": 0.5657629370689392, "1": 0.0014342962531372905, "2": 0.1432536244392395, "3": 0.0010501279029995203, "4": 0.0009261739905923605, "5": 0.0008283941424451768, "6": 0.0007492894656024873, "7": 0.0006839755224063993, "8": 0.0006291352328844368, "9": 0.0005824362160637975, "10": 0.23600709438323975, "11": 0.0005071478663012385, "12": 0.00047635959344916046, "13": 0.00044909559073857963, "14": 0.00042478356044739485, "15": 0.00040296860970556736, "16": 0.00038328487426042557, "17": 0.04476544260978699, "18": 0.00034917285665869713, "19": 0.0003342967829667032}}, {"key": "gharibi2024t5apr", "year": "2024", "title": "T5APR: Empowering Automated Program Repair across Languages through Checkpoint Ensemble", "topic_distr": {"0": 0.38257521390914917, "1": 0.0012729562586173415, "2": 0.001076052081771195, "3": 0.0009319993550889194, "4": 0.0008219831506721675, "5": 0.24230006337165833, "6": 0.0006649969145655632, "7": 0.0006070305244065821, "8": 0.0005583596066571772, "9": 0.0005169140640646219, "10": 0.0004811961844097823, "11": 0.00045009542373009026, "12": 0.00042277073953300714, "13": 0.0003985738439951092, "14": 0.059967365115880966, "15": 0.30568334460258484, "16": 0.00034016661811619997, "17": 0.00032432438456453383, "18": 0.00030989208607934415, "19": 0.00029668951174244285}}, {"key": "gholamian2021naturalness", "year": "2021", "title": "On the Naturalness and Localness of Software Logs", "topic_distr": {"0": 0.2641570568084717, "1": 0.0017259421292692423, "2": 0.0014590523205697536, "3": 0.0012637253385037184, "4": 0.0011145543539896607, "5": 0.1797180324792862, "6": 0.000901691906619817, "7": 0.0008230933453887701, "8": 0.0007570987800136209, "9": 0.000700901378877461, "10": 0.0006524702766910195, "11": 0.0006102996412664652, "12": 0.0005732491845265031, "13": 0.0005404397961683571, "14": 0.0005111828213557601, "15": 0.00048493078793399036, "16": 0.40584704279899597, "17": 0.000439762428868562, "18": 0.1373172253370285, "19": 0.0004022913926746696}}, {"key": "glassman2015overcode", "year": "2015", "title": "OverCode: visualizing variation in student solutions to programming problems at scale", "topic_distr": {"0": 0.002654556417837739, "1": 0.002166884485632181, "2": 0.0018316243076696992, "3": 0.001586390077136457, "4": 0.5802422761917114, "5": 0.0012514186091721058, "6": 0.001131918397732079, "7": 0.001033251523040235, "8": 0.1668996959924698, "9": 0.1797463446855545, "10": 0.0008190636872313917, "11": 0.0007661257986910641, "12": 0.0007196153164841235, "13": 0.000678428856190294, "14": 0.0006417017430067062, "15": 0.05566718429327011, "16": 0.0005790115101262927, "17": 0.0005520457634702325, "18": 0.0005274799768812954, "19": 0.0005050073377788067}}, {"key": "goens2019case", "year": "2019", "title": "A case study on machine learning for synthesizing benchmarks", "topic_distr": {"0": 0.7370573282241821, "1": 0.0023681726306676865, "2": 0.2445843368768692, "3": 0.0017339111072942615, "4": 0.0015292485477402806, "5": 0.0013677992392331362, "6": 0.0012371859047561884, "7": 0.0011293430579826236, "8": 0.00103879370726645, "9": 0.0009616868919692934, "10": 0.0008952359785325825, "11": 0.0008373748860321939, "12": 0.0007865389925427735, "13": 0.0007415221771225333, "14": 0.0007013794383965433, "15": 0.0006653597811236978, "16": 0.0006328590679913759, "17": 0.0006033855606801808, "18": 0.0005765351816080511, "19": 0.0005519725964404643}}, {"key": "gros2020code", "year": "2020", "title": "Code to Comment \"Translation\": Data, Metrics, Baselining & Evaluation", "topic_distr": {"0": 0.18362784385681152, "1": 0.0010833251290023327, "2": 0.0009157847380265594, "3": 0.0007931889849714935, "4": 0.0006995575386099517, "5": 0.0006257022614590824, "6": 0.0005659529706463218, "7": 0.000516620057169348, "8": 0.00047519811778329313, "9": 0.00043992543942295015, "10": 0.0290953628718853, "11": 0.00038305867929011583, "12": 0.0003598037001211196, "13": 0.0003392106737010181, "14": 0.2572481334209442, "15": 0.0003043700708076358, "16": 0.0002895025536417961, "17": 0.00027601985493674874, "18": 0.00026373707805760205, "19": 0.5216977000236511}}, {"key": "gu2016deep", "year": "2016", "title": "Deep API Learning", "topic_distr": {"0": 0.0016401044558733702, "1": 0.12190152704715729, "2": 0.001132657052949071, "3": 0.0009810264455154538, "4": 0.0008652278338558972, "5": 0.0007738821441307664, "6": 0.0006999829201959074, "7": 0.8669176697731018, "8": 0.0005877353833056986, "9": 0.000544109323527664, "10": 0.0005065122968517244, "11": 0.00047377528971992433, "12": 0.00044501302181743085, "13": 0.0004195431247353554, "14": 0.00039683093200437725, "15": 0.0003764514985959977, "16": 0.00035806302912533283, "17": 0.0003413873491808772, "18": 0.0003261957608629018, "19": 0.0003122985945083201}}, {"key": "gu2017deepam", "year": "2017", "title": "DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning", "topic_distr": {"0": 0.0017091595800593495, "1": 0.0013951066648587584, "2": 0.001179217011667788, "3": 0.0010213693603873253, "4": 0.0009008048800751567, "5": 0.000805702933575958, "6": 0.0007287650951184332, "7": 0.8814198970794678, "8": 0.0006119020981714129, "9": 0.0005664822529070079, "10": 0.0005273392889648676, "11": 0.0004932562005706131, "12": 0.00046331126941367984, "13": 0.00043679410009644926, "14": 0.000413147994549945, "15": 0.00039193060365505517, "16": 0.0003727860457729548, "17": 0.0003554246504791081, "18": 0.10588247328996658, "19": 0.0003251398156862706}}, {"key": "gu2018deep", "year": "2018", "title": "Deep Code Search", "topic_distr": {"0": 0.0014011729508638382, "1": 0.4585311710834503, "2": 0.0009672150481492281, "3": 0.00083774549420923, "4": 0.0007388529484160244, "5": 0.0006608489202335477, "6": 0.0005977433756925166, "7": 0.3965390622615814, "8": 0.0005018907249905169, "9": 0.0004646367160603404, "10": 0.00043253108742646873, "11": 0.0004045756359118968, "12": 0.00038001438952051103, "13": 0.0003582646313589066, "14": 0.0003388697805348784, "15": 0.0003214669704902917, "16": 0.0003057643480133265, "17": 0.0002915242803283036, "18": 0.00027855159714818, "19": 0.13564808666706085}}, {"key": "gui2022cross", "year": "2022", "title": "Cross-Language Binary-Source Code Matching with Intermediate Representations", "topic_distr": {"0": 0.39615869522094727, "1": 0.2650891840457916, "2": 0.0012476143892854452, "3": 0.24566935002803802, "4": 0.0009530345560051501, "5": 0.0008524178992956877, "6": 0.000771018851082772, "7": 0.08365446329116821, "8": 0.0006473801331594586, "9": 0.0005993268569000065, "10": 0.0005579143762588501, "11": 0.0005218551377765834, "12": 0.0004901740467175841, "13": 0.0004621193802449852, "14": 0.00043710230966098607, "15": 0.0004146546998526901, "16": 0.0003944001509808004, "17": 0.0003760321415029466, "18": 0.00035929889418184757, "19": 0.000343991385307163}}, {"key": "gulwani2014nlyze", "year": "2014", "title": "NLyze: Interactive Programming by Natural Language for SpreadSheet Data Analysis and Manipulation", "topic_distr": {"0": 0.0015210708370432258, "1": 0.0012419911799952388, "2": 0.001049801125191152, "3": 0.0009092779364436865, "4": 0.000801947433501482, "5": 0.000717282120604068, "6": 0.0006487876526080072, "7": 0.0005922341952100396, "8": 0.0005447496077977121, "9": 0.0005043143173679709, "10": 0.00046946704969741404, "11": 0.00043912435648962855, "12": 0.0004124657134525478, "13": 0.000388858636142686, "14": 0.12720729410648346, "15": 0.00034891863469965756, "16": 0.00033187505323439837, "17": 0.00031641896930523217, "18": 0.5333630442619324, "19": 0.3281910717487335}}, {"key": "guo2017semantically", "year": "2017", "title": "Semantically enhanced software traceability using deep learning techniques", "topic_distr": {"0": 0.001503705163486302, "1": 0.22988928854465485, "2": 0.0745607241988182, "3": 0.07904009521007538, "4": 0.0007922780932858586, "5": 0.0007086326368153095, "6": 0.0006409641355276108, "7": 0.5268973708152771, "8": 0.0005381806404329836, "9": 0.0004982329555787146, "10": 0.0004638059181161225, "11": 0.00043382911826483905, "12": 0.0004074919270351529, "13": 0.00038416951429098845, "14": 0.081671342253685, "15": 0.000344711123034358, "16": 0.00032787310192361474, "17": 0.00031260339892469347, "18": 0.00029869269928894937, "19": 0.00028596725314855576}}, {"key": "guo2020graphcodebert", "year": "2020", "title": "GraphCodeBERT: Pre-training Code Representations with Data Flow", "topic_distr": {"0": 0.0014687621733173728, "1": 0.0011982993455603719, "2": 0.0010127680143341422, "3": 0.03382498398423195, "4": 0.23116616904735565, "5": 0.0006919646402820945, "6": 0.0006258878274820745, "7": 0.0005713305436074734, "8": 0.0005255219875834882, "9": 0.00048651391989551485, "10": 0.03706220164895058, "11": 0.0004236249078530818, "12": 0.42584028840065, "13": 0.00037513335701078176, "14": 0.018956022337079048, "15": 0.0003366030869074166, "16": 0.00032016111072152853, "17": 0.24454286694526672, "18": 0.0002916670637205243, "19": 0.0002792409504763782}}, {"key": "guo2022learning", "year": "2022", "title": "Learning to Complete Code with Sketches", "topic_distr": {"0": 0.0022282262798398733, "1": 0.0018187075620517135, "2": 0.0015372452326118946, "3": 0.0013314449461176991, "4": 0.001174285775050521, "5": 0.0010503102093935013, "6": 0.30432048439979553, "7": 0.0008672034600749612, "8": 0.0007976721972227097, "9": 0.000738463131710887, "10": 0.27854931354522705, "11": 0.0006430060020647943, "12": 0.000603969965595752, "13": 0.0005694023566320539, "14": 0.0005385774420574307, "15": 0.000510918558575213, "16": 0.00048596179112792015, "17": 0.0004633296048268676, "18": 0.00044271163642406464, "19": 0.40132877230644226}}, {"key": "guo2022unixcoder", "year": "2022", "title": "UniXcoder: Unified Cross-Modal Pre-training for Code Representation", "topic_distr": {"0": 0.001757620950229466, "1": 0.0014344209339469671, "2": 0.0012124709319323301, "3": 0.0010501404758542776, "4": 0.2589055895805359, "5": 0.0008283984498120844, "6": 0.0007492933073081076, "7": 0.0006839790148660541, "8": 0.04041310027241707, "9": 0.03716401010751724, "10": 0.056059546768665314, "11": 0.0005071504856459796, "12": 0.3963793218135834, "13": 0.00044909791904501617, "14": 0.0004247857432346791, "15": 0.0004029706760775298, "16": 0.0003832868533208966, "17": 0.09325055778026581, "18": 0.00034917466109618545, "19": 0.10759513825178146}}, {"key": "guo2024deepseek", "year": "2024", "title": "DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence", "topic_distr": {"0": 0.7767341732978821, "1": 0.0028288369067013264, "2": 0.0023912140168249607, "3": 0.002071097958832979, "4": 0.20077280700206757, "5": 0.0016337847337126732, "6": 0.001477772369980812, "7": 0.0013489576522260904, "8": 0.0012408000184223056, "9": 0.001148698735050857, "10": 0.0010693256044760346, "11": 0.0010002127382904291, "12": 0.00093949114670977, "13": 0.0008857202483341098, "14": 0.0008377713384106755, "15": 0.000794747204054147, "16": 0.0007559263030998409, "17": 0.0007207213202491403, "18": 0.0006886495393700898, "19": 0.0006593104917556047}}, {"key": "gupta2017deepfix", "year": "2017", "title": "DeepFix: Fixing Common C Language Errors by Deep Learning", "topic_distr": {"0": 0.0017098661046475172, "1": 0.11757554113864899, "2": 0.0011792165460065007, "3": 0.0010213595815002918, "4": 0.0009008005145005882, "5": 0.0008056981605477631, "6": 0.0007287606713362038, "7": 0.3253690302371979, "8": 0.0006118984310887754, "9": 0.000566478818655014, "10": 0.0005273360875435174, "11": 0.026921860873699188, "12": 0.00046330844634212554, "13": 0.0004367914516478777, "14": 0.0004131454916205257, "15": 0.28141310811042786, "16": 0.0003727837756741792, "17": 0.0003554224967956543, "18": 0.23830245435237885, "19": 0.00032513783662579954}}, {"key": "gupta2018deep", "year": "2018", "title": "Deep Reinforcement Learning for Programming Language Correction", "topic_distr": {"0": 0.0023986436426639557, "1": 0.09866112470626831, "2": 0.06021299958229065, "3": 0.0014338415348902345, "4": 0.0012645918177440763, "5": 0.001131083001382649, "6": 0.061330199241638184, "7": 0.20049792528152466, "8": 0.0008590162615291774, "9": 0.0007952538435347378, "10": 0.0007403031340800226, "11": 0.08512556552886963, "12": 0.000650417641736567, "13": 0.000613191572483629, "14": 0.06903083622455597, "15": 0.10011818259954453, "16": 0.0005233341362327337, "17": 0.03825625777244568, "18": 0.2759007513523102, "19": 0.0004564461996778846}}, {"key": "gupta2018intelligent", "year": "2018", "title": "Intelligent code reviews using deep learning", "topic_distr": {"0": 0.0013575527118518949, "1": 0.0011070823529735208, "2": 0.5371245741844177, "3": 0.0008104500593617558, "4": 0.000714779831469059, "5": 0.23519353568553925, "6": 0.0005782685475423932, "7": 0.0005278621101751924, "8": 0.000485538796056062, "9": 0.00044949856237508357, "10": 0.0004184389836154878, "11": 0.0003913943364750594, "12": 0.0003676332999020815, "13": 0.00034659216180443764, "14": 0.000327829213347286, "15": 0.0003109934041276574, "16": 0.0002958023687824607, "17": 0.00028202624525874853, "18": 0.21865214407444, "19": 0.0002579955034889281}}, {"key": "gupta2019neural", "year": "2019", "title": "Neural Attribution for Semantic Bug-Localization in Student Programs", "topic_distr": {"0": 0.001781187835149467, "1": 0.0014549070037901402, "2": 0.0012297871289774776, "3": 0.07613669335842133, "4": 0.0009393988875672221, "5": 0.5282787680625916, "6": 0.0007599883247166872, "7": 0.0006937417783774436, "8": 0.0006381184794008732, "9": 0.0005907526356168091, "10": 0.0005499326507560909, "11": 0.0005143892485648394, "12": 0.1627795547246933, "13": 0.0004555080959107727, "14": 0.00043084894423373044, "15": 0.07005000859498978, "16": 0.00038875770405866206, "17": 0.00037065247306600213, "18": 0.15161794424057007, "19": 0.00033907010219991207}}, {"key": "gupta2023grace", "year": "2023", "title": "Grace: Language Models Meet Code Edits", "topic_distr": {"0": 0.41079747676849365, "1": 0.0018521632300689816, "2": 0.0015652008587494493, "3": 0.001355676562525332, "4": 0.0011956315720453858, "5": 0.0010694032534956932, "6": 0.0009672841406427324, "7": 0.0008829680737107992, "8": 0.0008121728315018117, "9": 0.0007518874481320381, "10": 0.0006999332690611482, "11": 0.20914992690086365, "12": 0.0006149493856355548, "13": 0.12414882332086563, "14": 0.0005483680870383978, "15": 0.24173922836780548, "16": 0.0004947959678247571, "17": 0.00047175231156870723, "18": 0.0004507595731411129, "19": 0.00043155549792572856}}, {"key": "gvero2015synthesizing", "year": "2015", "title": "Synthesizing Java expressions from free-form queries", "topic_distr": {"0": 0.0018606864614412189, "1": 0.19186756014823914, "2": 0.0012848132755607367, "3": 0.0011128243058919907, "4": 0.0009814713848754764, "5": 0.0008778517949394882, "6": 0.39760616421699524, "7": 0.15918032824993134, "8": 0.1187414675951004, "9": 0.025764120742678642, "10": 0.0005745611852034926, "11": 0.0005374260363169014, "12": 0.0005047995946370065, "13": 0.0004759078728966415, "14": 0.00045014434726908803, "15": 0.020078809931874275, "16": 0.00040616808109916747, "17": 0.00038725201738998294, "18": 0.07695342600345612, "19": 0.00035425525857135653}}, {"key": "habib2019neural", "year": "2019", "title": "Neural Bug Finding: A Study of Opportunities and Challenges", "topic_distr": {"0": 0.0014677336439490318, "1": 0.1928427517414093, "2": 0.0010127628920599818, "3": 0.0008771609282121062, "4": 0.0007736221887171268, "5": 0.7972795963287354, "6": 0.000625872693490237, "7": 0.0005713167483918369, "8": 0.0005255092983134091, "9": 0.0004865021619480103, "10": 0.00045288566616363823, "11": 0.00042361466330476105, "12": 0.00039789758739061654, "13": 0.0003751243057195097, "14": 0.00035481672966852784, "15": 0.00033659496693871915, "16": 0.00032015336910262704, "17": 0.0003052431857213378, "18": 0.0002916600205935538, "19": 0.00027923419838771224}}, {"key": "hajipour2019samplefix", "year": "2019", "title": "SampleFix: Learning to Correct Programs by Sampling Diverse Fixes", "topic_distr": {"0": 0.0018356508808210492, "1": 0.0014975060475990176, "2": 0.22573131322860718, "3": 0.0010964784305542707, "4": 0.0943194255232811, "5": 0.0008649549563415349, "6": 0.0007823590422049165, "7": 0.07491414248943329, "8": 0.0006569018587470055, "9": 0.0006081417668610811, "10": 0.0005661202012561262, "11": 0.000529530574567616, "12": 0.0004973835311830044, "13": 0.0004689162306021899, "14": 0.00044353120028972626, "15": 0.4085129499435425, "16": 0.00040020098094828427, "17": 0.0003815628297161311, "18": 0.18554387986660004, "19": 0.00034905082429759204}}, {"key": "haldar2020multiperspective", "year": "2020", "title": "A Multi-Perspective Architecture for Semantic Code Search", "topic_distr": {"0": 0.003467018250375986, "1": 0.33268019556999207, "2": 0.17557796835899353, "3": 0.002071127761155367, "4": 0.0018266425468027592, "5": 0.0016337953275069594, "6": 0.0014777819160372019, "7": 0.001348966732621193, "8": 0.0012408082839101553, "9": 0.0011487064184620976, "10": 0.001069332705810666, "11": 0.0010002193739637733, "12": 0.2017015814781189, "13": 0.000885726185515523, "14": 0.0008377769263461232, "15": 0.0007947525009512901, "16": 0.0007559313671663404, "17": 0.0007207261514849961, "18": 0.0006886541377753019, "19": 0.2690722942352295}}, {"key": "haque2020improved", "year": "2020", "title": "Improved Automatic Summarization of Subroutines via Attention to File Context", "topic_distr": {"0": 0.0024954781401902437, "1": 0.5093085169792175, "2": 0.0017216274281963706, "3": 0.0014911767793819308, "4": 0.0013151619350537658, "5": 0.0011763145448639989, "6": 0.001063986448571086, "7": 0.0009712409810163081, "8": 0.0008933681529015303, "9": 0.17159777879714966, "10": 0.0007699076668359339, "11": 0.0007201468106359243, "12": 0.000676427676808089, "13": 0.00063771300483495, "14": 0.0006031900411471725, "15": 0.0005722129717469215, "16": 0.0005442621768452227, "17": 0.0005189147777855396, "18": 0.0004958233330398798, "19": 0.3024267554283142}}, {"key": "haque2022semantic", "year": "2022", "title": "Semantic Similarity Metrics for Evaluating Source Code Summarization", "topic_distr": {"0": 0.21272213757038116, "1": 0.403923898935318, "2": 0.001037122681736946, "3": 0.0008983089937828481, "4": 0.000792267092037946, "5": 0.0007086244877427816, "6": 0.0006409568013623357, "7": 0.0005850860034115613, "8": 0.0005381745286285877, "9": 0.000498227309435606, "10": 0.00046380062121897936, "11": 0.0004338241706136614, "12": 0.00040748727042227983, "13": 0.0003841651196125895, "14": 0.00036336813354864717, "15": 0.00034470719401724637, "16": 0.0003278693475294858, "17": 0.0003125998191535473, "18": 0.00029868929414078593, "19": 0.3743187189102173}}, {"key": "harer2018learning", "year": "2018", "title": "Learning to Repair Software Vulnerabilities with Generative Adversarial Networks", "topic_distr": {"0": 0.0031172616872936487, "1": 0.30517661571502686, "2": 0.002152095315977931, "3": 0.0018639782210811973, "4": 0.0016439587343484163, "5": 0.11814908683300018, "6": 0.1770397126674652, "7": 0.001214053831063211, "8": 0.0011167125776410103, "9": 0.001033821958117187, "10": 0.0009623866062611341, "11": 0.0009001854341477156, "12": 0.0008455363567918539, "13": 0.0007971429149620235, "14": 0.12456750124692917, "15": 0.2568778097629547, "16": 0.0006803291034884751, "17": 0.0006486448692157865, "18": 0.0006197804468683898, "19": 0.00059337547281757}}, {"key": "hashimoto2018retrieve", "year": "2018", "title": "A Retrieve-and-Edit Framework for Predicting Structured Outputs", "topic_distr": {"0": 0.002711969194933772, "1": 0.0022140536457300186, "2": 0.001871439628303051, "3": 0.001620868337340653, "4": 0.22118140757083893, "5": 0.0012786244042217731, "6": 0.001156526617705822, "7": 0.001055714674293995, "8": 0.000971068860962987, "9": 0.0008989890338853002, "10": 0.0008368704002350569, "11": 0.18689703941345215, "12": 0.0007352599641308188, "13": 0.0006931780953891575, "14": 0.0006556524895131588, "15": 0.0006219811621122062, "16": 0.0005915993824601173, "17": 0.0005640474264509976, "18": 0.0005389475845731795, "19": 0.5729047656059265}}, {"key": "hata2018learning", "year": "2018", "title": "Learning to Generate Corrective Patches using Neural Machine Translation", "topic_distr": {"0": 0.0016209088498726487, "1": 0.0013229202013462782, "2": 0.0011180040892213583, "3": 0.0009683053940534592, "4": 0.0008540102862752974, "5": 0.0007638478418812156, "6": 0.0006909067160449922, "7": 0.0006306818104349077, "8": 0.0005801145453006029, "9": 0.0005370542639866471, "10": 0.0004999446682631969, "11": 0.0004676321696024388, "12": 0.0004392428381834179, "13": 0.000414103182265535, "14": 0.21007350087165833, "15": 0.4657532274723053, "16": 0.00035342026967555285, "17": 0.0003369608020875603, "18": 0.3122669458389282, "19": 0.000308249203953892}}, {"key": "hazoom2021text", "year": "2021", "title": "Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data", "topic_distr": {"0": 0.07108528167009354, "1": 0.09700698405504227, "2": 0.001687952782958746, "3": 0.0014619440771639347, "4": 0.0012893782695755363, "5": 0.0011532536009326577, "6": 0.0010431273840367794, "7": 0.04668790474534035, "8": 0.0008758540498092771, "9": 0.0008108417387120426, "10": 0.0007548139546997845, "11": 0.0007060286588966846, "12": 0.0006631665746681392, "13": 0.0006252108723856509, "14": 0.0005913647473789752, "15": 0.0005609949585050344, "16": 0.000533592130523175, "17": 0.0005087417084723711, "18": 0.00048610291560180485, "19": 0.7714674472808838}}, {"key": "he2019learning", "year": "2019", "title": "Learning to Fuzz from Symbolic Execution with Application to Smart Contracts", "topic_distr": {"0": 0.001920754206366837, "1": 0.24259884655475616, "2": 0.0013243964640423656, "3": 0.0011470750905573368, "4": 0.0010116760386154056, "5": 0.21478712558746338, "6": 0.00081846077227965, "7": 0.0007471173303201795, "8": 0.0006872143712826073, "9": 0.0006362043204717338, "10": 0.0005922436830587685, "11": 0.0005539656267501414, "12": 0.052824512124061584, "13": 0.0004905542591586709, "14": 0.0004639978287741542, "15": 0.3550300896167755, "16": 0.00041866814717650414, "17": 0.12320052832365036, "18": 0.00038140706601552665, "19": 0.00036515769897960126}}, {"key": "he2021learning", "year": "2021", "title": "Learning to Find Naming Issues with Big Code and Small Supervision", "topic_distr": {"0": 0.002114503411576152, "1": 0.0017258551670238376, "2": 0.2705574929714203, "3": 0.0012637156760320067, "4": 0.0011145499302074313, "5": 0.2641667425632477, "6": 0.0009016891126520932, "7": 0.0008230907842516899, "8": 0.0007570963934995234, "9": 0.0007008992251940072, "10": 0.0006524682394228876, "11": 0.000610297778621316, "12": 0.0005732474382966757, "13": 0.3730703890323639, "14": 0.0005111812497489154, "15": 0.07873328775167465, "16": 0.00046124201617203653, "17": 0.0004397610609885305, "18": 0.0004201919073238969, "19": 0.00040229014120996}}, {"key": "he2022distribution", "year": "2022", "title": "On Distribution Shift in Learning-based Bug Detectors", "topic_distr": {"0": 0.1797705441713333, "1": 0.0011981066782027483, "2": 0.0010127703426405787, "3": 0.0008771679131314158, "4": 0.0007736284169368446, "5": 0.8106213212013245, "6": 0.0006258767680265009, "7": 0.0005713204154744744, "8": 0.0005255126743577421, "9": 0.00048650530516169965, "10": 0.00045288860565051436, "11": 0.000423617399064824, "12": 0.00039790014852769673, "13": 0.00037512672133743763, "14": 0.0003548190288711339, "15": 0.0003365971497260034, "16": 0.00032015543547458947, "17": 0.00030524516478180885, "18": 0.00029166191234253347, "19": 0.00027923600282520056}}, {"key": "hellendoorn2015will", "year": "2015", "title": "Will they like this? Evaluating Code Contributions With Language Models", "topic_distr": {"0": 0.002973345573991537, "1": 0.0024243188090622425, "2": 0.0020495241042226553, "3": 0.0017752062994986773, "4": 0.0015656660543754697, "5": 0.0014003709657117724, "6": 0.0012666473630815744, "7": 0.0011562365107238293, "8": 0.0010635309154167771, "9": 0.0009845878230407834, "10": 0.9770984053611755, "11": 0.0008573155500926077, "12": 0.000805269053671509, "13": 0.0007591802859678864, "14": 0.0007180816028267145, "15": 0.0006812041974626482, "16": 0.0006479295552708209, "17": 0.0006177541799843311, "18": 0.000590264389757067, "19": 0.0005651169340126216}}, {"key": "hellendoorn2017deep", "year": "2017", "title": "Are Deep Neural Networks the Best Choice for Modeling Source Code?", "topic_distr": {"0": 0.0025460615288466215, "1": 0.002078473335132003, "2": 0.0017567503964528441, "3": 0.0015216090250760317, "4": 0.0013419974129647017, "5": 0.00120031728874892, "6": 0.0010856972075998783, "7": 0.000991059117950499, "8": 0.0009115972789004445, "9": 0.0008439318626187742, "10": 0.9803721308708191, "11": 0.0007348413928411901, "12": 0.000690230168402195, "13": 0.0006507255020551383, "14": 0.000615498109254986, "15": 0.0005838889046572149, "16": 0.0005553677910938859, "17": 0.0005295032169669867, "18": 0.0005059405812062323, "19": 0.00048438558587804437}}, {"key": "hellendoorn2018deep", "year": "2018", "title": "Deep Learning Type Inference", "topic_distr": {"0": 0.0014845207333564758, "1": 0.0012123878113925457, "2": 0.0010247989557683468, "3": 0.040793050080537796, "4": 0.0007828459492884576, "5": 0.0007001979392953217, "6": 0.0006333348574116826, "7": 0.0005781284417025745, "8": 0.0005317748291417956, "9": 0.8076915144920349, "10": 0.14144638180732727, "11": 0.0004286653420422226, "12": 0.00040264162817038596, "13": 0.0003795968077611178, "14": 0.0003590471460483968, "15": 0.0003406081232242286, "16": 0.00032397048198617995, "17": 0.000308882532408461, "18": 0.00029513740446418524, "19": 0.0002825634437613189}}, {"key": "hellendoorn2020global", "year": "2020", "title": "Global Relational Models of Source Code", "topic_distr": {"0": 0.001710257027298212, "1": 0.001394977793097496, "2": 0.001179229118861258, "3": 0.0010213881032541394, "4": 0.0009008175693452358, "5": 0.0008057147497311234, "6": 0.0007287754560820758, "7": 0.0006652496522292495, "8": 0.09282564371824265, "9": 0.026001809164881706, "10": 0.0005273467977531254, "11": 0.0004932631854899228, "12": 0.4337959587574005, "13": 0.00043680029921233654, "14": 0.00041315387352369726, "15": 0.0283119585365057, "16": 0.4077674448490143, "17": 0.00035542971454560757, "18": 0.00033961323788389564, "19": 0.0003251444431953132}}, {"key": "henkel2020semantic", "year": "2022", "title": "Semantic Robustness of Models of Source Code", "topic_distr": {"0": 0.5816547870635986, "1": 0.06044118106365204, "2": 0.001510236761532724, "3": 0.0013080708449706435, "4": 0.0011536619858816266, "5": 0.0010318646673113108, "6": 0.0009333299822174013, "7": 0.06749476492404938, "8": 0.0007836634758859873, "9": 0.0007254942320287228, "10": 0.0006753638153895736, "11": 0.0006317135412245989, "12": 0.2555045187473297, "13": 0.023336287587881088, "14": 0.0005291189299896359, "15": 0.000501945789437741, "16": 0.0004774273547809571, "17": 0.0004551926103886217, "18": 0.0004349367518443614, "19": 0.0004164068086538464}}, {"key": "heyman2020neural", "year": "2020", "title": "Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent", "topic_distr": {"0": 0.002655508928000927, "1": 0.18415141105651855, "2": 0.05077750235795975, "3": 0.0015863910084590316, "4": 0.0013991384766995907, "5": 0.0012514260597527027, "6": 0.0011319254990667105, "7": 0.0010332579258829355, "8": 0.0009504125919193029, "9": 0.0008798660128377378, "10": 0.0008190687512978911, "11": 0.0007661305135115981, "12": 0.0732324868440628, "13": 0.0006784330471418798, "14": 0.0006417057011276484, "15": 0.0006087505607865751, "16": 0.0005790150607936084, "17": 0.0005520491977222264, "18": 0.0005274832365103066, "19": 0.6757780313491821}}, {"key": "hindle2012naturalness", "year": "2012", "title": "On the Naturalness of Software", "topic_distr": {"0": 0.2643074095249176, "1": 0.00166926474776119, "2": 0.0014112303033471107, "3": 0.0012222958030179143, "4": 0.00107801822014153, "5": 0.0009642074583098292, "6": 0.09693191200494766, "7": 0.0007961117080412805, "8": 0.0007322804885916412, "9": 0.0006779253017157316, "10": 0.2720332741737366, "11": 0.0005902935517951846, "12": 0.0005544576561078429, "13": 0.07789123058319092, "14": 0.027788110077381134, "15": 0.00046903439215384424, "16": 0.24966207146644592, "17": 0.0004253466904629022, "18": 0.0004064189561177045, "19": 0.0003891039814334363}}, {"key": "hoang2020cc2vec", "year": "2020", "title": "CC2Vec: Distributed Representations of Code Changes", "topic_distr": {"0": 0.0019185428973287344, "1": 0.4295347332954407, "2": 0.0013243515277281404, "3": 0.18712548911571503, "4": 0.13435262441635132, "5": 0.07536911219358444, "6": 0.0008184573380276561, "7": 0.0007471141288988292, "8": 0.0006872114608995616, "9": 0.0006362016429193318, "10": 0.0005922411801293492, "11": 0.16341441869735718, "12": 0.0005203329492360353, "13": 0.000490552163682878, "14": 0.0004639958788175136, "15": 0.0004401671467348933, "16": 0.0004186663718428463, "17": 0.00039916826062835753, "18": 0.00038140546530485153, "19": 0.00036515615647658706}}, {"key": "hong2021fix", "year": "2021", "title": "Fix-Filter-Fix: Intuitively Connect Any Models for Effective Bug Fixing", "topic_distr": {"0": 0.0019498947076499462, "1": 0.0015912603121250868, "2": 0.001345030847005546, "3": 0.001164993504062295, "4": 0.0010274755768477917, "5": 0.0009189997217617929, "6": 0.0008312428835779428, "7": 0.0007587852305732667, "8": 0.0006979467580094934, "9": 0.0006461400771513581, "10": 0.0006014928803779185, "11": 0.00056261703139171, "12": 0.0005284613580442965, "13": 0.0004982153768651187, "14": 0.04954962432384491, "15": 0.7823463678359985, "16": 0.15381786227226257, "17": 0.00040540387271903455, "18": 0.00038736360147595406, "19": 0.00037086044903844595}}, {"key": "hsiao2014using", "year": "2014", "title": "Using Web Corpus Statistics for Program Analysis", "topic_distr": {"0": 0.0019795906264334917, "1": 0.0016164151020348072, "2": 0.0013663799036294222, "3": 0.0011834808392450213, "4": 0.0010437746532261372, "5": 0.40164297819137573, "6": 0.38931936025619507, "7": 0.0007708234479650855, "8": 0.000709019775968045, "9": 0.0006563911447301507, "10": 0.0006110356189310551, "11": 0.0005715430597774684, "12": 0.19547615945339203, "13": 0.0005061195697635412, "14": 0.00047872052527964115, "15": 0.00045413561747409403, "16": 0.0004319525323808193, "17": 0.00041183564462698996, "18": 0.00039350916631519794, "19": 0.0003767441667150706}}, {"key": "hu2017codesum", "year": "2017", "title": "CodeSum: Translate Program Language to Natural Language", "topic_distr": {"0": 0.0020800770726054907, "1": 0.22889085114002228, "2": 0.0014347621472552419, "3": 0.0012426638277247548, "4": 0.001095977146178484, "5": 0.0009802703280001879, "6": 0.0008866627467796206, "7": 0.0008093742071650922, "8": 0.10148951411247253, "9": 0.0006892189267091453, "10": 0.0006415950483642519, "11": 0.0006001273286528885, "12": 0.24108178913593292, "13": 0.0005314318696036935, "14": 0.0005026625003665686, "15": 0.0004768480721395463, "16": 0.0004535555490292609, "17": 0.0004324325709603727, "18": 0.00041318952571600676, "19": 0.4152670204639435}}, {"key": "huang2021cosqa", "year": "2021", "title": "CoSQA: 20,000+ Web Queries for Code Search and Question Answering", "topic_distr": {"0": 0.0031997549813240767, "1": 0.002611783565953374, "2": 0.8603948950767517, "3": 0.0019117603078484535, "4": 0.0016860991017892957, "5": 0.0015080926241353154, "6": 0.0013640820980072021, "7": 0.0012451780494302511, "8": 0.0011453412007540464, "9": 0.0010603256523609161, "10": 0.0009870589710772038, "11": 0.0009232631418853998, "12": 0.0008672130643390119, "13": 0.0008175789262168109, "14": 0.0007733188685961068, "15": 0.0007336047128774226, "16": 0.0006977704470045865, "17": 0.0006652739248238504, "18": 0.11679907143115997, "19": 0.0006085875793360174}}, {"key": "husain2019codesearchnet", "year": "2019", "title": "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search", "topic_distr": {"0": 0.1695977747440338, "1": 0.10552185028791428, "2": 0.001212438684888184, "3": 0.0010501346550881863, "4": 0.0009261785889975727, "5": 0.0008283979259431362, "6": 0.07028618454933167, "7": 0.24021098017692566, "8": 0.0006291380850598216, "9": 0.0005824388354085386, "10": 0.0005421933019533753, "11": 0.0005071501364000142, "12": 0.0004763617180287838, "13": 0.0004490976280067116, "14": 0.000424785481300205, "15": 0.0004029704141430557, "16": 0.0003832865913864225, "17": 0.00036543619353324175, "18": 0.00034917445736937225, "19": 0.4052540063858032}}, {"key": "hussain2019deep", "year": "2019", "title": "Deep Transfer Learning for Source Code Modeling", "topic_distr": {"0": 0.17448358237743378, "1": 0.29010897874832153, "2": 0.001229807618074119, "3": 0.0010651536285877228, "4": 0.14794860780239105, "5": 0.09627848118543625, "6": 0.0007600057288073003, "7": 0.08898893743753433, "8": 0.0006381330895237625, "9": 0.000590766198001802, "10": 0.08902128040790558, "11": 0.0005144010647200048, "12": 0.00048317245091311634, "13": 0.00045551854418590665, "14": 0.0004308587813284248, "15": 0.00040873183752410114, "16": 0.00038876658072695136, "17": 0.10551159828901291, "18": 0.00035416672471910715, "19": 0.000339077872922644}}, {"key": "iyer2016summarizing", "year": "2016", "title": "Summarizing Source Code using a Neural Attention Model", "topic_distr": {"0": 0.0020464430563151836, "1": 0.2782384753227234, "2": 0.0014112464850768447, "3": 0.0012223152443766594, "4": 0.001078037777915597, "5": 0.0009642247459851205, "6": 0.13811476528644562, "7": 0.13808824121952057, "8": 0.0007322935271076858, "9": 0.0006779373507015407, "10": 0.0006310930475592613, "11": 0.00059030408738181, "12": 0.0005544674932025373, "13": 0.0005227330839261413, "14": 0.0004944346728734672, "15": 0.0004690427449531853, "16": 0.00044613148202188313, "17": 0.00042535425745882094, "18": 0.00040642620297148824, "19": 0.4328860342502594}}, {"key": "iyer2018mapping", "year": "2018", "title": "Mapping Language to Code in Programmatic Context", "topic_distr": {"0": 0.3950027823448181, "1": 0.0019214216154068708, "2": 0.0016242270357906818, "3": 0.001406766939908266, "4": 0.0012407200410962105, "5": 0.0011097309179604053, "6": 0.0010037608444690704, "7": 0.13106600940227509, "8": 0.0008428002474829555, "9": 0.0007802414475008845, "10": 0.0007263280567713082, "11": 0.0006793838692829013, "12": 0.0006381393759511411, "13": 0.0006016161059960723, "14": 0.0005690472899004817, "15": 0.0005398236098699272, "16": 0.3304159343242645, "17": 0.0004895423189736903, "18": 0.0004677578981500119, "19": 0.12887395918369293}}, {"key": "iyer2019learning", "year": "2019", "title": "Learning Programmatic Idioms for Scalable Semantic Parsing", "topic_distr": {"0": 0.0022282118443399668, "1": 0.0018184183863922954, "2": 0.0015371968038380146, "3": 0.0013314064126461744, "4": 0.14396384358406067, "5": 0.0010502804070711136, "6": 0.0009499875595793128, "7": 0.0008671791874803603, "8": 0.8401460647583008, "9": 0.0007384424679912627, "10": 0.0006874173413962126, "11": 0.000642988015897572, "12": 0.0006039530853740871, "13": 0.0005693864077329636, "14": 0.0005385623662732542, "15": 0.0005109042394906282, "16": 0.00048594819963909686, "17": 0.0004633166245184839, "18": 0.00044269923819229007, "19": 0.00042383858817629516}}, {"key": "jain2020contrastive", "year": "2020", "title": "Contrastive Code Representation Learning", "topic_distr": {"0": 0.001919399481266737, "1": 0.0015669962158426642, "2": 0.10536620765924454, "3": 0.0011470942990854383, "4": 0.1554792821407318, "5": 0.11639124900102615, "6": 0.0008184670004993677, "7": 0.0007471229764632881, "8": 0.0006872196099720895, "9": 0.14981171488761902, "10": 0.0005922482232563198, "11": 0.0005539698759093881, "12": 0.40968725085258484, "13": 0.0004905579844489694, "14": 0.0004640013794414699, "15": 0.0004401723563205451, "16": 0.0004186713485978544, "17": 0.000399173004552722, "18": 0.05265401303768158, "19": 0.00036516046384349465}}, {"key": "jayasundara2019treecaps", "year": "2019", "title": "TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing", "topic_distr": {"0": 0.001418435131199658, "1": 0.3219619691371918, "2": 0.0009782471461221576, "3": 0.0008472749614156783, "4": 0.0007472612196579576, "5": 0.0006683692918159068, "6": 0.0006045455229468644, "7": 0.0005518485559150577, "8": 0.15169084072113037, "9": 0.00046992412535473704, "10": 0.000437453156337142, "11": 0.000409179599955678, "12": 0.22834980487823486, "13": 0.00036234158324077725, "14": 0.00034272600896656513, "15": 0.04878443479537964, "16": 0.00030924382735975087, "17": 0.0002948417386505753, "18": 0.2405015230178833, "19": 0.0002697190211620182}}, {"key": "jesse2021learning", "year": "2021", "title": "Learning Type Annotation: Is Big Data Enough?", "topic_distr": {"0": 0.0023112830240279436, "1": 0.15554818511009216, "2": 0.12208884954452515, "3": 0.0013807544019073248, "4": 0.001217768993228674, "5": 0.0010892024729400873, "6": 0.0009851926006376743, "7": 0.0008993155206553638, "8": 0.34196045994758606, "9": 0.36695095896720886, "10": 0.0007128919824026525, "11": 0.0006668161950074136, "12": 0.0006263346876949072, "13": 0.0005904869758524001, "14": 0.0005585206672549248, "15": 0.0005298375617712736, "16": 0.0005039566894993186, "17": 0.0004804864292964339, "18": 0.00045910500921308994, "19": 0.0004395454016048461}}, {"key": "jesse2022learning", "year": "2022", "title": "Learning To Predict User-Defined Types", "topic_distr": {"0": 0.0020447857677936554, "1": 0.0016693210927769542, "2": 0.001411148114129901, "3": 0.0012222710065543652, "4": 0.15291312336921692, "5": 0.0009641870274208486, "6": 0.0008721152553334832, "7": 0.0007960948860272765, "8": 0.0007322650053538382, "9": 0.7097361087799072, "10": 0.0006310684257186949, "11": 0.0005902810953557491, "12": 0.0005544458981603384, "13": 0.0005227127112448215, "14": 0.0004944154061377048, "15": 0.0004690244677476585, "16": 0.12315578758716583, "17": 0.00042533769737929106, "18": 0.0004064103704877198, "19": 0.000389095745049417}}, {"key": "jesse2023large", "year": "2023", "title": "Large Language Models and Simple, Stupid Bugs", "topic_distr": {"0": 0.2869919240474701, "1": 0.001818594173528254, "2": 0.001537220785394311, "3": 0.0013314293464645743, "4": 0.0011742737842723727, "5": 0.21955743432044983, "6": 0.0009500053129158914, "7": 0.0008671952527947724, "8": 0.0007976646302267909, "9": 0.0007384561467915773, "10": 0.47955411672592163, "11": 0.0006429999484680593, "12": 0.0006039642612449825, "13": 0.0005693969433195889, "14": 0.0005385723197832704, "15": 0.0005109137273393571, "16": 0.000485957192722708, "17": 0.0004633252101484686, "18": 0.00044270744547247887, "19": 0.0004238464462105185}}, {"key": "jian2021multimodal", "year": "2021", "title": "Multimodal Representation for Neural Code Search", "topic_distr": {"0": 0.00239854259416461, "1": 0.415827214717865, "2": 0.0016554462490603328, "3": 0.34775686264038086, "4": 0.0012645829701796174, "5": 0.0011310765985399485, "6": 0.0010230683255940676, "7": 0.0009338896488770843, "8": 0.0008590115467086434, "9": 0.0007952494197525084, "10": 0.0007402990595437586, "11": 0.0006924518966116011, "12": 0.0006504140328615904, "13": 0.000613188196439296, "14": 0.0005799929494969547, "15": 0.0005502071580849588, "16": 0.000523331284057349, "17": 0.0004989586886949837, "18": 0.00047675526002421975, "19": 0.22102950513362885}}, {"key": "jian2022assemble", "year": "2022", "title": "Assemble Foundation Models for Automatic Code Summarization", "topic_distr": {"0": 0.18018437922000885, "1": 0.002368329092860222, "2": 0.0020019863732159138, "3": 0.001733963843435049, "4": 0.19944751262664795, "5": 0.0013678256655111909, "6": 0.0012372098863124847, "7": 0.001129364944063127, "8": 0.0010388139635324478, "9": 0.0009617055766284466, "10": 0.0008952533826231956, "11": 0.0008373911259695888, "12": 0.1298379749059677, "13": 0.000741536554414779, "14": 0.000701393058989197, "15": 0.0006653727032244205, "16": 0.0006328713498078287, "17": 0.0006033973186276853, "18": 0.0005765464156866074, "19": 0.47303712368011475}}, {"key": "jiang2017automatically", "year": "2017", "title": "Automatically Generating Commit Messages from Diffs using Neural Machine Translation", "topic_distr": {"0": 0.0017323312349617481, "1": 0.27604323625564575, "2": 0.0011956199305132031, "3": 0.0010355733102187514, "4": 0.0009133348939940333, "5": 0.20194660127162933, "6": 0.0007389019592665136, "7": 0.0006744934362359345, "8": 0.0006204133969731629, "9": 0.000574361823964864, "10": 0.0005346743855625391, "11": 0.5108491778373718, "12": 0.0004697557305917144, "13": 0.00044286969932727516, "14": 0.0004188947204966098, "15": 0.00039738218765705824, "16": 0.0003779713297262788, "17": 0.00036036845995113254, "18": 0.0003443322202656418, "19": 0.00032966237631626427}}, {"key": "jiang2021treebert", "year": "2021", "title": "TreeBERT: A Tree-Based Pre-Trained Model for Programming Language", "topic_distr": {"0": 0.002045366680249572, "1": 0.0016698518302291632, "2": 0.0014112094650045037, "3": 0.0012223008088767529, "4": 0.44011595845222473, "5": 0.0009642151999287307, "6": 0.0008721405756659806, "7": 0.0007961179362609982, "8": 0.0007322861929424107, "9": 0.0006779305986128747, "10": 0.0006310867029242218, "11": 0.0005902982084080577, "12": 0.5451180934906006, "13": 0.000522727845236659, "14": 0.0004944297252222896, "15": 0.0004690380592364818, "16": 0.00044612702913582325, "17": 0.00042535000829957426, "18": 0.0004064221284352243, "19": 0.0003891070082318038}}, {"key": "johnson2020learning", "year": "2020", "title": "Learning Graph Structure With A Finite-State Automaton Layer", "topic_distr": {"0": 0.001600028364919126, "1": 0.12983471155166626, "2": 0.0011036205105483532, "3": 0.0009558909805491567, "4": 0.0008430593297816813, "5": 0.0007540523656643927, "6": 0.0006820465205237269, "7": 0.0006225940305739641, "8": 0.209602490067482, "9": 0.13835196197032928, "10": 0.000493533443659544, "11": 0.0004616352671291679, "12": 0.5122284889221191, "13": 0.0004087927518412471, "14": 0.00038666254840791225, "15": 0.0003668052959255874, "16": 0.00034888804657384753, "17": 0.00033263961086049676, "18": 0.00031783731537871063, "19": 0.00030429623438976705}}, {"key": "jung2021commitbert", "year": "2021", "title": "CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model", "topic_distr": {"0": 0.180076465010643, "1": 0.13421636819839478, "2": 0.0015652022557333112, "3": 0.0013556759804487228, "4": 0.001195613294839859, "5": 0.0010693870717659593, "6": 0.0009672694723121822, "7": 0.0008829546859487891, "8": 0.0008121604914776981, "9": 0.0007518760394304991, "10": 0.0006999226752668619, "11": 0.4276409447193146, "12": 0.0006149400142021477, "13": 0.000579744519200176, "14": 0.24520243704319, "15": 0.0005201984895393252, "16": 0.0004947884590364993, "17": 0.00047174515202641487, "18": 0.00045075270463712513, "19": 0.0004315489495638758}}, {"key": "kacmajor2019automatic", "year": "2019", "title": "Automatic Acquisition of Annotated Training Corpora for Test-Code Generation", "topic_distr": {"0": 0.17594310641288757, "1": 0.0011842329986393452, "2": 0.3348836302757263, "3": 0.0008669658564031124, "4": 0.0007646310841664672, "5": 0.0006839057896286249, "6": 0.000618598482105881, "7": 0.10442453622817993, "8": 0.000519401510246098, "9": 0.0004808477533515543, "10": 0.0004476220055948943, "11": 0.00041869119741022587, "12": 0.000393273017834872, "13": 0.06427266448736191, "14": 0.19474846124649048, "15": 0.00033268285915255547, "16": 0.0003164323861710727, "17": 0.00030169548699632287, "18": 0.0002882701810449362, "19": 0.11811035871505737}}, {"key": "kanade2020pretrained", "year": "2020", "title": "Pre-trained Contextual Embedding of Source Code", "topic_distr": {"0": 0.0017578023253008723, "1": 0.0014343267539516091, "2": 0.001212496543303132, "3": 0.001050140941515565, "4": 0.24942970275878906, "5": 0.0008284022915177047, "6": 0.12805551290512085, "7": 0.0006839820998720825, "8": 0.0006291412864811718, "9": 0.0005824418622069061, "10": 0.0005421960959210992, "11": 0.0005071527557447553, "12": 0.0004763641918543726, "13": 0.15302112698554993, "14": 0.00042478766408748925, "15": 0.04651869833469391, "16": 0.274274080991745, "17": 0.000365438056178391, "18": 0.0003491762327030301, "19": 0.13785703480243683}}, {"key": "karaivanov2014phrase", "year": "2014", "title": "Phrase-Based Statistical Translation of Programming Languages", "topic_distr": {"0": 0.00201194966211915, "1": 0.0016425007488578558, "2": 0.0013884843792766333, "3": 0.001202586106956005, "4": 0.0010606361320242286, "5": 0.0009486593189649284, "6": 0.04187586531043053, "7": 0.0802856907248497, "8": 0.047319941222667694, "9": 0.000666993495542556, "10": 0.0006209053681232035, "11": 0.0005807748530060053, "12": 0.0005455167847685516, "13": 0.000514294661115855, "14": 0.6311009526252747, "15": 0.00046147103421390057, "16": 0.00043892962276004255, "17": 0.0004184878198429942, "18": 0.18653255701065063, "19": 0.0003828295157290995}}, {"key": "karampatsis2019deep", "year": "2019", "title": "Maybe Deep Neural Networks are the Best Choice for Modeling Source Code", "topic_distr": {"0": 0.16376864910125732, "1": 0.10748819261789322, "2": 0.0010627965675666928, "3": 0.0009204974048770964, "4": 0.0008118405239656568, "5": 0.0007261309074237943, "6": 0.0006567914388142526, "7": 0.0005995403626002371, "8": 0.0005514699732884765, "9": 0.0005105358432047069, "10": 0.19935685396194458, "11": 0.0004445416561793536, "12": 0.0004175541107542813, "13": 0.3440098762512207, "14": 0.17706617712974548, "15": 0.0003532230912242085, "16": 0.00033596926368772984, "17": 0.0003203224914614111, "18": 0.00030606830841861665, "19": 0.0002930286282207817}}, {"key": "karampatsis2020big", "year": "2020", "title": "Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code", "topic_distr": {"0": 0.4673548638820648, "1": 0.0017259652959182858, "2": 0.001458999002352357, "3": 0.00126371206715703, "4": 0.0011145469034090638, "5": 0.000996879767626524, "6": 0.0009016860858537257, "7": 0.000823088048491627, "8": 0.0007570938905701041, "9": 0.0007008968386799097, "10": 0.0006524660275317729, "11": 0.000610295741353184, "12": 0.0005732454592362046, "13": 0.5183466672897339, "14": 0.000511179503519088, "15": 0.00048492764472030103, "16": 0.0004612404736690223, "17": 0.0004397596057970077, "18": 0.00042019051034003496, "19": 0.000402288802433759}}, {"key": "karampatsis2020scelmo", "year": "2020", "title": "SCELMo: Source Code Embeddings from Language Models", "topic_distr": {"0": 0.0737415999174118, "1": 0.001697608851827681, "2": 0.0014347307151183486, "3": 0.13245469331741333, "4": 0.10552731901407242, "5": 0.1872764229774475, "6": 0.3546968996524811, "7": 0.0008093709475360811, "8": 0.0007444766233675182, "9": 0.0006892161327414215, "10": 0.0006415924872271717, "11": 0.0006001249421387911, "12": 0.0005636921268887818, "13": 0.0005314297741279006, "14": 0.0005026605213060975, "15": 0.030462918803095818, "16": 0.00045355374459177256, "17": 0.0004324308247305453, "18": 0.10634369403123856, "19": 0.0003955845022574067}}, {"key": "karmakar2021what", "year": "2021", "title": "What do pre-trained code models know about code?", "topic_distr": {"0": 0.09349551051855087, "1": 0.0020782954525202513, "2": 0.001756854704581201, "3": 0.10617172718048096, "4": 0.33766499161720276, "5": 0.0012003511656075716, "6": 0.0010857274755835533, "7": 0.0009910869412124157, "8": 0.23034991323947906, "9": 0.000843955553136766, "10": 0.0007856396259739995, "11": 0.0007348619983531535, "12": 0.0006902494933456182, "13": 0.0006507437210530043, "14": 0.0006155153387226164, "15": 0.0005839052610099316, "16": 0.0005553833907470107, "17": 0.15219485759735107, "18": 0.0005059547838754952, "19": 0.0670444443821907}}, {"key": "karmakar2022jemma", "year": "2022", "title": "JEMMA: An Extensible Java Dataset for ML4Code Applications", "topic_distr": {"0": 0.6846237182617188, "1": 0.0013949786080047488, "2": 0.0011792535660788417, "3": 0.0010213827481493354, "4": 0.0009008125634863973, "5": 0.0008057092200033367, "6": 0.00072877062484622, "7": 0.00066524522844702, "8": 0.0006119066965766251, "9": 0.027985524386167526, "10": 0.0005273432470858097, "11": 0.0004932599258609116, "12": 0.20284263789653778, "13": 0.00043679738882929087, "14": 0.00041315110865980387, "15": 0.0003919335431419313, "16": 0.00037278883974067867, "17": 0.0003554273280315101, "18": 0.07392428815364838, "19": 0.00032514226040802896}}, {"key": "karpathy2015visualizing", "year": "2015", "title": "Visualizing and Understanding Recurrent Networks", "topic_distr": {"0": 0.3276548385620117, "1": 0.23024895787239075, "2": 0.00195653410628438, "3": 0.0016946062678471208, "4": 0.14431962370872498, "5": 0.001336761750280857, "6": 0.0012091118842363358, "7": 0.04183756932616234, "8": 0.07977776229381561, "9": 0.1631307452917099, "10": 0.0008749213884584606, "11": 0.000818373286165297, "12": 0.0007686909520998597, "13": 0.0007246956811286509, "14": 0.0006854638340882957, "15": 0.0006502615287899971, "16": 0.0006184983649291098, "17": 0.0005896936636418104, "18": 0.000563452544156462, "19": 0.0005394473555497825}}, {"key": "katz2019towards", "year": "2019", "title": "Towards Neural Decompilation", "topic_distr": {"0": 0.00235368381254375, "1": 0.0019214974017813802, "2": 0.00162420270498842, "3": 0.0014068130403757095, "4": 0.0012407362228259444, "5": 0.4379337728023529, "6": 0.040458474308252335, "7": 0.0009162776404991746, "8": 0.16696828603744507, "9": 0.0007802520412951708, "10": 0.0007263379520736635, "11": 0.0006793931243009865, "12": 0.0006381480488926172, "13": 0.0006016242550686002, "14": 0.053220152854919434, "15": 0.0005398309440352023, "16": 0.000513461884111166, "17": 0.0004895489546470344, "18": 0.28653964400291443, "19": 0.00044783574412576854}}, {"key": "key2022speak", "year": "2022", "title": "I Speak, You Verify: Toward Trustworthy Neural Program Synthesis", "topic_distr": {"0": 0.003117460524663329, "1": 0.0025466689839959145, "2": 0.0021523400209844112, "3": 0.0018639646004885435, "4": 0.0016439438331872225, "5": 0.0014703868655487895, "6": 0.0013299769489094615, "7": 0.001214045798406005, "8": 0.0011167051270604134, "9": 0.001033815206028521, "10": 0.0009623802616260946, "11": 0.0009001794969663024, "12": 0.0008455308270640671, "13": 0.0007971376762725413, "14": 0.000753984204493463, "15": 0.0007152629550546408, "16": 0.0006803246797062457, "17": 0.0006486405618488789, "18": 0.9756138920783997, "19": 0.0005933715729042888}}, {"key": "kharkar2022learning", "year": "2022", "title": "Learning to Reduce False Positives in Analytic Bug Detectors", "topic_distr": {"0": 0.002653303323313594, "1": 0.002166826045140624, "2": 0.0018315119668841362, "3": 0.0015863657463341951, "4": 0.0013991123996675014, "5": 0.7307546138763428, "6": 0.0011319038458168507, "7": 0.0010332382516935468, "8": 0.0009503945475444198, "9": 0.0008798493072390556, "10": 0.09886261820793152, "11": 0.0007661159615963697, "12": 0.0007196061196736991, "13": 0.000678420125041157, "14": 0.0006416934775188565, "15": 0.0006087390356697142, "16": 0.15175114572048187, "17": 0.000552038720343262, "18": 0.0005274732247926295, "19": 0.0005050008767284453}}, {"key": "kim2020code", "year": "2020", "title": "Code Prediction by Feeding Trees to Transformers", "topic_distr": {"0": 0.001890194951556623, "1": 0.0015430597122758627, "2": 0.0013043048093095422, "3": 0.0011296833399683237, "4": 0.000996338319964707, "5": 0.0008911499171517789, "6": 0.0008060525869950652, "7": 0.0007357907015830278, "8": 0.11650151759386063, "9": 0.0006265592528507113, "10": 0.31590452790260315, "11": 0.0005455673090182245, "12": 0.1592736542224884, "13": 0.00048311727005057037, "14": 0.07618657499551773, "15": 0.00043349587940610945, "16": 0.00041232098010368645, "17": 0.0003931183891836554, "18": 0.00037562480429187417, "19": 0.31956741213798523}}, {"key": "koc2017learning", "year": "2017", "title": "Learning a Classifier for False Positive Error Reports Emitted by Static Code Analysis Tools", "topic_distr": {"0": 0.0016002601478248835, "1": 0.0013056871248409152, "2": 0.053079478442668915, "3": 0.0009558986639603972, "4": 0.0008430650341324508, "5": 0.47990795969963074, "6": 0.0006820532726123929, "7": 0.25360310077667236, "8": 0.12183774262666702, "9": 0.000530172314029187, "10": 0.08229335397481918, "11": 0.0004616398364305496, "12": 0.0004336143028922379, "13": 0.00040879679727368057, "14": 0.0003866663610097021, "15": 0.0003668089339043945, "16": 0.0003488914808258414, "17": 0.00033264289959333837, "18": 0.00031784045859239995, "19": 0.00030429926118813455}}, {"key": "kocetkov2022stack", "year": "2022", "title": "The Stack: 3TB of permissively licensed source code", "topic_distr": {"0": 0.9809674620628357, "1": 0.0022138028871268034, "2": 0.0018713631434366107, "3": 0.0016208401648327708, "4": 0.0014295245055109262, "5": 0.0012786018196493387, "6": 0.0011565060121938586, "7": 0.001055695815011859, "8": 0.0009710515150800347, "9": 0.000898972968570888, "10": 0.000836855499073863, "11": 0.000782767659984529, "12": 0.0007352468674071133, "13": 0.000693165697157383, "14": 0.0006556407897733152, "15": 0.0006219701026566327, "16": 0.0005915887886658311, "17": 0.0005640373565256596, "18": 0.0005389379803091288, "19": 0.0005159771535545588}}, {"key": "korbak2021energy", "year": "2021", "title": "Energy-Based Models for Code Generation under Compilability Constraints", "topic_distr": {"0": 0.27246639132499695, "1": 0.25107258558273315, "2": 0.0025319510605186224, "3": 0.002192934276536107, "4": 0.10841076076030731, "5": 0.0017298907041549683, "6": 0.001564700505696237, "7": 0.0014283088967204094, "8": 0.001313788932748139, "9": 0.0012162699131295085, "10": 0.056043289601802826, "11": 0.0010590492747724056, "12": 0.0009947558864951134, "13": 0.0009378219256177545, "14": 0.2932051718235016, "15": 0.0008414974436163902, "16": 0.0008003929979167879, "17": 0.0007631171029061079, "18": 0.000729158753529191, "19": 0.0006980938487686217}}, {"key": "kovalchuk2022human", "year": "2022", "title": "Human perceiving behavior modeling in evaluation of code generation models", "topic_distr": {"0": 0.0021929831709712744, "1": 0.0017865090630948544, "2": 0.001510230591520667, "3": 0.0013080491917207837, "4": 0.001153648248873651, "5": 0.0010318518616259098, "6": 0.0009333186317235231, "7": 0.0008519632974639535, "8": 0.0007836539880372584, "9": 0.0007254854426719248, "10": 0.0006753556663170457, "11": 0.983079195022583, "12": 0.000593355915043503, "13": 0.0005593957030214369, "14": 0.0005291125271469355, "15": 0.000501939735841006, "16": 0.0004774215631186962, "17": 0.0004551870806608349, "18": 0.00043493148405104876, "19": 0.0004164017445873469}}, {"key": "kovalchuk2023test", "year": "2023", "title": "Test-based and metric-based evaluation of code generation models for practical question answering", "topic_distr": {"0": 0.0018088259967043996, "1": 0.0014759199693799019, "2": 0.3135528862476349, "3": 0.001080565620213747, "4": 0.10838429629802704, "5": 0.0008524066652171314, "6": 0.000771008781157434, "7": 0.0007038016337901354, "8": 0.0006473716930486262, "9": 0.0005993190570734441, "10": 0.0005579071003012359, "11": 0.5662879347801208, "12": 0.0004901676438748837, "13": 0.0004621133557520807, "14": 0.00043709660531021655, "15": 0.00041464928654022515, "16": 0.00039439499960280955, "17": 0.0003760272520594299, "18": 0.00035929420846514404, "19": 0.00034398690331727266}}, {"key": "kovalenko2019pathminer", "year": "2019", "title": "PathMiner : A Library for Mining of Path-Based Representations of Code", "topic_distr": {"0": 0.19686096906661987, "1": 0.047404974699020386, "2": 0.0017567932372912765, "3": 0.11841221898794174, "4": 0.0013420181348919868, "5": 0.0012003355659544468, "6": 0.33841821551322937, "7": 0.0009910741355270147, "8": 0.0009116110159084201, "9": 0.0008439446100965142, "10": 0.0007856294396333396, "11": 0.0007348525105044246, "12": 0.28641200065612793, "13": 0.0006507353391498327, "14": 0.0006155074224807322, "15": 0.0005838977522216737, "16": 0.0005553761729970574, "17": 0.0005295111914165318, "18": 0.000505948206409812, "19": 0.00048439292004331946}}, {"key": "kremenek2007factor", "year": "2007", "title": "A Factor Graph Model for Software Bug Finding", "topic_distr": {"0": 0.003197770332917571, "1": 0.0026108180172741413, "2": 0.0022073648869991302, "3": 0.0019117699703201652, "4": 0.0016861100448295474, "5": 0.7665761113166809, "6": 0.0013640914112329483, "7": 0.09526963531970978, "8": 0.0011453491169959307, "9": 0.0010603329865261912, "10": 0.0009870657231658697, "11": 0.0009232694865204394, "12": 0.11612845212221146, "13": 0.0008175845723599195, "14": 0.0007733241654932499, "15": 0.0007336097187362611, "16": 0.0006977752200327814, "17": 0.0006652784650214016, "18": 0.0006356738740578294, "19": 0.0006085917702876031}}, {"key": "kulal2019spoc", "year": "2019", "title": "SPoC: Search-based Pseudocode to Code", "topic_distr": {"0": 0.0022281804122030735, "1": 0.17267920076847076, "2": 0.19175289571285248, "3": 0.001331433653831482, "4": 0.001174275646917522, "5": 0.0010503025259822607, "6": 0.0009500072337687016, "7": 0.1270493119955063, "8": 0.0007976663764566183, "9": 0.0007384577766060829, "10": 0.0006874315440654755, "11": 0.0006430013454519212, "12": 0.0006039656000211835, "13": 0.0005693981656804681, "14": 0.135319784283638, "15": 0.0005109148332849145, "16": 0.0004859582695644349, "17": 0.0004633262287825346, "18": 0.3605406582355499, "19": 0.0004238473775330931}}, {"key": "kurbatova2020recommendation", "year": "2020", "title": "Recommendation of Move Method Refactoring Using Path-Based Representation of Code", "topic_distr": {"0": 0.23785355687141418, "1": 0.001786454813554883, "2": 0.28876084089279175, "3": 0.26966187357902527, "4": 0.0011536551173776388, "5": 0.0010318598942831159, "6": 0.0009333257912658155, "7": 0.0008519697585143149, "8": 0.08156406879425049, "9": 0.0007254909723997116, "10": 0.0006753607303835452, "11": 0.0006317106890492141, "12": 0.0005933603970333934, "13": 0.0005593999521806836, "14": 0.0005291165434755385, "15": 0.0005019435193389654, "16": 0.11087950319051743, "17": 0.00045519054401665926, "18": 0.00043493477278389037, "19": 0.00041640488780103624}}, {"key": "kushman2013using", "year": "2013", "title": "Using Semantic Unification to Generate Regular Expressions from Natural Language", "topic_distr": {"0": 0.0029683327302336693, "1": 0.19556277990341187, "2": 0.002049611182883382, "3": 0.15922212600708008, "4": 0.0015656655887141824, "5": 0.0014003728283569217, "6": 0.0012666488764807582, "7": 0.0011562376748770475, "8": 0.0010635320795699954, "9": 0.0009845889871940017, "10": 0.0009165555238723755, "11": 0.0008573164814151824, "12": 0.0008052699267864227, "13": 0.0007591811008751392, "14": 0.031125344336032867, "15": 0.00068120495416224, "16": 0.0006479302537627518, "17": 0.0006177548784762621, "18": 0.0005902650882489979, "19": 0.5957592129707336}}, {"key": "lachaux2020unsupervised", "year": "2020", "title": "Unsupervised Translation of Programming Languages", "topic_distr": {"0": 0.0014853295870125294, "1": 0.0012123374035581946, "2": 0.0010248571634292603, "3": 0.000887624395545572, "4": 0.0007828525849618018, "5": 0.000700203818269074, "6": 0.0006333401543088257, "7": 0.0005781332729384303, "8": 0.0005317792529240251, "9": 0.0004923067172057927, "10": 0.0004582891706377268, "11": 0.3398062586784363, "12": 0.11032518744468689, "13": 0.0003795999800786376, "14": 0.5391507744789124, "15": 0.0003406109462957829, "16": 0.0003239731886424124, "17": 0.0003088851226493716, "18": 0.00029513987828977406, "19": 0.0002825658011715859}}, {"key": "lacomis2019neural", "year": "2019", "title": "A Neural Approach to Decompiled Identifier Renaming", "topic_distr": {"0": 0.0022281890269368887, "1": 0.001818640623241663, "2": 0.11366790533065796, "3": 0.0013314379611983895, "4": 0.0011742758797481656, "5": 0.0010503020603209734, "6": 0.000950006942730397, "7": 0.000867196882609278, "8": 0.1818341463804245, "9": 0.16151171922683716, "10": 0.0006874313694424927, "11": 0.0006430011708289385, "12": 0.0006039654253982008, "13": 0.5287664532661438, "14": 0.0005385733675211668, "15": 0.0005109146586619318, "16": 0.0004859581240452826, "17": 0.0004633260832633823, "18": 0.0004427082894835621, "19": 0.00042384726111777127}}, {"key": "lanchantin2018exploring", "year": "2018", "title": "Exploring the Naturalness of Buggy Code with Recurrent Neural Network", "topic_distr": {"0": 0.003283912315964699, "1": 0.35727205872535706, "2": 0.002265360439196229, "3": 0.0019620866514742374, "4": 0.0017304900102317333, "5": 0.2685287594795227, "6": 0.0013999921502545476, "7": 0.0012779578100889921, "8": 0.0011754927691072226, "9": 0.0010882391361519694, "10": 0.0010130436858162284, "11": 0.0009475683909840882, "12": 0.2545038163661957, "13": 0.09932869672775269, "14": 0.0007936767651699483, "15": 0.0007529171416535974, "16": 0.0007161395042203367, "17": 0.0006827875040471554, "18": 0.0006524037453345954, "19": 0.0006246088887564838}}, {"key": "leclair2019neural", "year": "2019", "title": "A Neural Model for Generating Natural Language Summaries of Program Subroutines", "topic_distr": {"0": 0.0018349973252043128, "1": 0.2485591471195221, "2": 0.001265999278984964, "3": 0.0010964730754494667, "4": 0.0009670444414950907, "5": 0.0008649491355754435, "6": 0.0007823535706847906, "7": 0.0007141574751585722, "8": 0.0006568972603417933, "9": 0.0006081375759094954, "10": 0.000566116243135184, "11": 0.0005295269074849784, "12": 0.0992889478802681, "13": 0.12778620421886444, "14": 0.04932467266917229, "15": 0.00042075052624568343, "16": 0.0004001981869805604, "17": 0.00038156018126755953, "18": 0.07670004665851593, "19": 0.3872518241405487}}, {"key": "leclair2019recommendations", "year": "2019", "title": "Recommendations for Datasets for Source Code Summarization", "topic_distr": {"0": 0.0019497096072882414, "1": 0.001591252046637237, "2": 0.001345073920674622, "3": 0.0011649985099211335, "4": 0.001027478720061481, "5": 0.0009190042619593441, "6": 0.0008312471327371895, "7": 0.000758789072278887, "8": 0.00069795036688447, "9": 0.0006461433949880302, "10": 0.0006014959653839469, "11": 0.0005626199417747557, "12": 0.0005284640355966985, "13": 0.000498217879794538, "14": 0.000471246603410691, "15": 0.0004470455169212073, "16": 0.00042520876741036773, "17": 0.000405405939090997, "18": 0.0003873655805364251, "19": 0.9847412705421448}}, {"key": "leclair2020improved", "year": "2020", "title": "Improved Code Summarization via a Graph Neural Network", "topic_distr": {"0": 0.10554533451795578, "1": 0.2993769943714142, "2": 0.0010009667603299022, "3": 0.0008669738308526576, "4": 0.0007646360900253057, "5": 0.0006839103298261762, "6": 0.0006186026730574667, "7": 0.0005646803765557706, "8": 0.0005194050609134138, "9": 0.00048085101298056543, "10": 0.00044762500328943133, "11": 0.00041869402048178017, "12": 0.27733513712882996, "13": 0.000370766909327358, "14": 0.00035069521982222795, "15": 0.0003326851292513311, "16": 0.00031643451075069606, "17": 0.00030169752426445484, "18": 0.0002882721310015768, "19": 0.30941566824913025}}, {"key": "lee2020montage", "year": "2020", "title": "Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer", "topic_distr": {"0": 0.002353051444515586, "1": 0.001921465969644487, "2": 0.0016241988632827997, "3": 0.0014067719457671046, "4": 0.0012407166650518775, "5": 0.7413252592086792, "6": 0.0010037600295618176, "7": 0.0009162644855678082, "8": 0.0008427995489910245, "9": 0.0007802408072166145, "10": 0.000726327474694699, "11": 0.0006793833454139531, "12": 0.24155066907405853, "13": 0.000601615640334785, "14": 0.0005690468242391944, "15": 0.0005398231442086399, "16": 0.00051345449173823, "17": 0.0004895419115200639, "18": 0.0004677575489040464, "19": 0.00044782928307540715}}, {"key": "lee2021cotraining", "year": "2021", "title": "Co-Training for Commit Classification", "topic_distr": {"0": 0.003284457139670849, "1": 0.0026798141188919544, "2": 0.002265462651848793, "3": 0.0019621073734015226, "4": 0.0017304993234574795, "5": 0.11170332133769989, "6": 0.29845449328422546, "7": 0.0012779650278389454, "8": 0.0011754994047805667, "9": 0.03150829300284386, "10": 0.3488190472126007, "11": 0.18918734788894653, "12": 0.0008900477550923824, "13": 0.0008391067385673523, "14": 0.0007936812471598387, "15": 0.000752921390812844, "16": 0.0007161435787566006, "17": 0.0006827913457527757, "18": 0.000652407412417233, "19": 0.0006246123812161386}}, {"key": "levy2017learning", "year": "2017", "title": "Learning to Align the Source Code to the Compiled Object Code", "topic_distr": {"0": 0.00311731593683362, "1": 0.22818100452423096, "2": 0.002152057131752372, "3": 0.0018639726331457496, "4": 0.0016439472092315555, "5": 0.0014703893102705479, "6": 0.0013299793936312199, "7": 0.15275539457798004, "8": 0.0011167071061208844, "9": 0.0010338169522583485, "10": 0.0009623819496482611, "11": 0.0009001810685731471, "12": 0.21020521223545074, "13": 0.0007971390150487423, "14": 0.10860105603933334, "15": 0.00071526417741552, "16": 0.0006803258438594639, "17": 0.28126072883605957, "18": 0.0006197774200700223, "19": 0.0005933725624345243}}, {"key": "lherondelle2022topical", "year": "2022", "title": "Topical: Learning Repository Embeddings from Source Code using Attention", "topic_distr": {"0": 0.1320033073425293, "1": 0.0009699870715849102, "2": 0.12264136224985123, "3": 0.16635125875473022, "4": 0.0006262746755965054, "5": 0.0005601559532806277, "6": 0.3362923264503479, "7": 0.0004625008150469512, "8": 0.00042541808215901256, "9": 0.00039384045521728694, "10": 0.09285221993923187, "11": 0.0003429308417253196, "12": 0.14424654841423035, "13": 0.0003036762063857168, "14": 0.0002872365294024348, "15": 0.0002724853693507612, "16": 0.00025917531456798315, "17": 0.0002471049956511706, "18": 0.00023610895732417703, "19": 0.00022604981495533139}}, {"key": "li2016gated", "year": "2016", "title": "Gated Graph Sequence Neural Networks", "topic_distr": {"0": 0.002312674652785063, "1": 0.572661817073822, "2": 0.0015942121390253305, "3": 0.0013807296054437757, "4": 0.0012177515309304, "5": 0.00108918861951679, "6": 0.0009851801441982388, "7": 0.0008993041701614857, "8": 0.0008271990809589624, "9": 0.000765798322390765, "10": 0.0007128829602152109, "11": 0.0006668077548965812, "12": 0.3481697142124176, "13": 0.0005904795252718031, "14": 0.0005585136241279542, "15": 0.0005298308678902686, "16": 0.000503950344864279, "17": 0.06363532692193985, "18": 0.00045909921755082905, "19": 0.0004395398427732289}}, {"key": "li2017code", "year": "2017", "title": "Code Completion with Neural Attention and Pointer Networks", "topic_distr": {"0": 0.0018076787237077951, "1": 0.5376569628715515, "2": 0.0012475763214752078, "3": 0.0010805604979395866, "4": 0.0009530144743621349, "5": 0.0008523994474671781, "6": 0.0007710023201070726, "7": 0.0007037956966087222, "8": 0.0006473662797361612, "9": 0.0005993140512146056, "10": 0.4498807489871979, "11": 0.000521843961905688, "12": 0.0004901635111309588, "13": 0.0004621094558387995, "14": 0.00043709290912374854, "15": 0.0004146458231844008, "16": 0.0003943916817661375, "17": 0.0003760240797419101, "18": 0.00035929118166677654, "19": 0.00034398402203805745}}, {"key": "li2017software", "year": "2017", "title": "Software Defect Prediction via Convolutional Neural Network", "topic_distr": {"0": 0.0011988584883511066, "1": 0.12879876792430878, "2": 0.06942728906869888, "3": 0.14759649336338043, "4": 0.000632291950751096, "5": 0.4455477297306061, "6": 0.000511534046381712, "7": 0.00046694473712705076, "8": 0.0004295057151466608, "9": 0.0003976246516685933, "10": 0.0003701494715642184, "11": 0.00034622589009813964, "12": 0.20242710411548615, "13": 0.00030659406911581755, "14": 0.0002899964165408164, "15": 0.0002751035208348185, "16": 0.00026166558382101357, "17": 0.0002494793152436614, "18": 0.00023837760090827942, "19": 0.00022822180471848696}}, {"key": "li2019improving", "year": "2019", "title": "Improving Bug Detection via Context-Based Code Representation Learning and Attention-Based Neural Networks", "topic_distr": {"0": 0.0010670014889910817, "1": 0.0836343765258789, "2": 0.0007357848808169365, "3": 0.05316726118326187, "4": 0.0005620502633973956, "5": 0.347025066614151, "6": 0.0004547069256659597, "7": 0.00041507111745886505, "8": 0.00038179123657755554, "9": 0.0003534519055392593, "10": 0.1684626042842865, "11": 0.0003077631117776036, "12": 0.17488913238048553, "13": 0.12588797509670258, "14": 0.0002577802515588701, "15": 0.04152901843190193, "16": 0.0002325967507204041, "17": 0.00022176426136866212, "18": 0.00021189585095271468, "19": 0.00020286827930249274}}, {"key": "li2019neural", "year": "2019", "title": "Neural Code Search Evaluation Dataset", "topic_distr": {"0": 0.5886807441711426, "1": 0.3904097080230713, "2": 0.002326550427824259, "3": 0.002015081699937582, "4": 0.0017772240098565817, "5": 0.0015895961550995708, "6": 0.0014378032647073269, "7": 0.0013124729739502072, "8": 0.0012072406243532896, "9": 0.0011176303960382938, "10": 0.001040404080413282, "11": 0.0009731603786349297, "12": 0.000914081116206944, "13": 0.0008617645944468677, "14": 0.0008151124930009246, "15": 0.0007732519879937172, "16": 0.0007354810950346291, "17": 0.000701228273101151, "18": 0.0006700239609926939, "19": 0.0006414784002117813}}, {"key": "li2019using", "year": "2019", "title": "Using GGNN to recommend log statement level", "topic_distr": {"0": 0.07626023888587952, "1": 0.0017262090696021914, "2": 0.0014590704813599586, "3": 0.0012637353502213955, "4": 0.0011145597090944648, "5": 0.0009968919912353158, "6": 0.3721310794353485, "7": 0.0008230980020016432, "8": 0.0007571030873805285, "9": 0.0007009053369984031, "10": 0.000652474001981318, "11": 0.0356498546898365, "12": 0.18876853585243225, "13": 0.0005404428811743855, "14": 0.0005111857317388058, "15": 0.0429854616522789, "16": 0.15583623945713043, "17": 0.00043976493179798126, "18": 0.11698084324598312, "19": 0.0004022936918772757}}, {"key": "li2020dlfix", "year": "2020", "title": "DLFix: Context-based Code Transformation Learning for Automated Program Repair", "topic_distr": {"0": 0.0015794092323631048, "1": 0.0012890164507552981, "2": 0.001089670928195119, "3": 0.07064329087734222, "4": 0.0008323861984536052, "5": 0.000744508346542716, "6": 0.0006734139169566333, "7": 0.000614713819231838, "8": 0.13332410156726837, "9": 0.0005234567797742784, "10": 0.26492032408714294, "11": 0.0004557923530228436, "12": 0.0004281218280084431, "13": 0.00040361867286264896, "14": 0.00038176856469362974, "15": 0.5208092927932739, "16": 0.0003444721514824778, "17": 0.0003284294216427952, "18": 0.0003138144384138286, "19": 0.0003004447789862752}}, {"key": "li2020learning", "year": "2020", "title": "Learning Code-Query Interaction for Enhancing Code Searches", "topic_distr": {"0": 0.0016207105945795774, "1": 0.40752217173576355, "2": 0.3552994728088379, "3": 0.16437456011772156, "4": 0.0008540095877833664, "5": 0.0007638483075425029, "6": 0.0006909071234986186, "7": 0.000630682276096195, "8": 0.0005801149527542293, "9": 0.06375876069068909, "10": 0.0004999450175091624, "11": 0.00046763248974457383, "12": 0.0004392431292217225, "13": 0.00041410347330383956, "14": 0.00039168575312942266, "15": 0.0003715705533977598, "16": 0.00035342053161002696, "17": 0.00033696103491820395, "18": 0.0003219664213247597, "19": 0.00030824943678453565}}, {"key": "li2021learning", "year": "2021", "title": "Learning to Extend Program Graphs to Work-in-Progress Code", "topic_distr": {"0": 0.2247159630060196, "1": 0.002680023666471243, "2": 0.0022653394844383, "3": 0.001962100388482213, "4": 0.00173049489967525, "5": 0.0015477992128580809, "6": 0.0013999970396980643, "7": 0.0012779623502865434, "8": 0.0011754969600588083, "9": 0.0010882429778575897, "10": 0.181916743516922, "11": 0.0009475717088207603, "12": 0.46827706694602966, "13": 0.0008391049923375249, "14": 0.0007936795591376722, "15": 0.10470642894506454, "16": 0.0007161420653574169, "17": 0.0006827898905612528, "18": 0.000652406073641032, "19": 0.0006246111006475985}}, {"key": "li2021toward", "year": "2021", "title": "Toward Less Hidden Cost of Code Completion with Acceptance and Ranking Models", "topic_distr": {"0": 0.0012605927186086774, "1": 0.0010288580087944865, "2": 0.0008695813012309372, "3": 0.000753142056055367, "4": 0.0006642374792136252, "5": 0.0005941138369962573, "6": 0.0005373780149966478, "7": 0.0004905359237454832, "8": 0.0004512053565122187, "9": 0.0004177135997451842, "10": 0.9902844429016113, "11": 0.0003637180489022285, "12": 0.00034163720556534827, "13": 0.00032208391348831356, "14": 0.0003046477504540235, "15": 0.00028900240431539714, "16": 0.000274885562248528, "17": 0.00026208360213786364, "18": 0.0002504209987819195, "19": 0.00023975211661309004}}, {"key": "li2022codereviewer", "year": "2022", "title": "CodeReviewer: Pre-Training for Automating Code Review Activities", "topic_distr": {"0": 0.001783472136594355, "1": 0.0014548880280926824, "2": 0.0012297708308324218, "3": 0.001065139309503138, "4": 0.0009394089574925601, "5": 0.0008402311941608787, "6": 0.000759995891712606, "7": 0.0006937487050890923, "8": 0.0006381248240359128, "9": 0.0005907585145905614, "10": 0.0005499381222762167, "11": 0.15467019379138947, "12": 0.00048316619358956814, "13": 0.00045551263610832393, "14": 0.0004308531933929771, "15": 0.000408726540626958, "16": 0.00038876154576428235, "17": 0.0003706561401486397, "18": 0.03636963665485382, "19": 0.7958769798278809}}, {"key": "li2022exploring", "year": "2022", "title": "Exploring Representation-Level Augmentation for Code Search", "topic_distr": {"0": 0.0016859080642461777, "1": 0.08755317330360413, "2": 0.8592849969863892, "3": 0.0010075492318719625, "4": 0.0008886123541742563, "5": 0.0007947975536808372, "6": 0.0007189010502770543, "7": 0.0006562359631061554, "8": 0.0006036197883076966, "9": 0.0005588147323578596, "10": 0.0005202015745453537, "11": 0.0004865798109676689, "12": 0.00045704017975367606, "13": 0.00043088194797746837, "14": 0.0004075558972544968, "15": 0.00038662567385472357, "16": 0.00036774025647901, "17": 0.00035061384551227093, "18": 0.04251941666007042, "19": 0.0003207389381714165}}, {"key": "li2023hitchhiker", "year": "2023", "title": "The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models", "topic_distr": {"0": 0.4136713445186615, "1": 0.0011316037271171808, "2": 0.0009564990177750587, "3": 0.000828456599265337, "4": 0.0007306665065698326, "5": 0.5149157047271729, "6": 0.0005911209736950696, "7": 0.0005395942134782672, "8": 0.0004963302635587752, "9": 0.0004594890051521361, "10": 0.0004277390835341066, "11": 0.00040009335498325527, "12": 0.00037580422940663993, "13": 0.00035429542185738683, "14": 0.00033511544461362064, "15": 0.06265628337860107, "16": 0.00030237677856348455, "17": 0.00028829448274336755, "18": 0.00027546551427803934, "19": 0.0002637296565808356}}, {"key": "li2023rethinking", "year": "2023", "title": "Rethinking Negative Pairs in Code Search", "topic_distr": {"0": 0.18841397762298584, "1": 0.5162384510040283, "2": 0.18695813417434692, "3": 0.07648398727178574, "4": 0.0009008035412989557, "5": 0.0008057018276304007, "6": 0.0007287640473805368, "7": 0.000665239233057946, "8": 0.0006119012250564992, "9": 0.0005664814379997551, "10": 0.0005273385322652757, "11": 0.0004932554438710213, "12": 0.00046331060002557933, "13": 0.0004367934598121792, "14": 0.00041314741247333586, "15": 0.00039193002157844603, "16": 0.00037278549280017614, "17": 0.023863228037953377, "18": 0.00033960791188292205, "19": 0.0003251393500249833}}, {"key": "li2023starcoder", "year": "2023", "title": "StarCoder: may the source be with you!", "topic_distr": {"0": 0.8572962284088135, "1": 0.0018185945227742195, "2": 0.0015372047200798988, "3": 0.0013314265524968505, "4": 0.0011742698261514306, "5": 0.0010502974037081003, "6": 0.0009500026935711503, "7": 0.0008671929826959968, "8": 0.0007976625929586589, "9": 0.04108086973428726, "10": 0.08741456270217896, "11": 0.0006429982604458928, "12": 0.0006039626896381378, "13": 0.0005693954881280661, "14": 0.0005385709227994084, "15": 0.0005109123885631561, "16": 0.00048595594125799835, "17": 0.0004633240168914199, "18": 0.00044270631042309105, "19": 0.0004238453402649611}}, {"key": "li2023think", "year": "2023", "title": "Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation", "topic_distr": {"0": 0.34259748458862305, "1": 0.0016425985377281904, "2": 0.0013884259387850761, "3": 0.0012025776086375117, "4": 0.16869105398654938, "5": 0.000948654895182699, "6": 0.0008580663707107306, "7": 0.0007832705159671605, "8": 0.0007204689318314195, "9": 0.0006669904687441885, "10": 0.0006209025159478188, "11": 0.4762316644191742, "12": 0.0005455143400467932, "13": 0.0005142923328094184, "14": 0.00048645082279108465, "15": 0.0004614689096342772, "16": 0.0004389276436995715, "17": 0.00041848589899018407, "18": 0.00039986346382647753, "19": 0.0003828277694992721}}, {"key": "li2024rewriting", "year": "2024", "title": "Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search", "topic_distr": {"0": 0.338958203792572, "1": 0.34573909640312195, "2": 0.09387879073619843, "3": 0.001202583429403603, "4": 0.0010606329888105392, "5": 0.0009486568160355091, "6": 0.0008580680005252361, "7": 0.0007832720293663442, "8": 0.0007204702706076205, "9": 0.0006669917493127286, "10": 0.0006209037383086979, "11": 0.0005807733396068215, "12": 0.0005455153295770288, "13": 0.000514293322339654, "14": 0.00048645175411365926, "15": 0.0004614698118530214, "16": 0.0004389284586068243, "17": 0.00041848671389743686, "18": 0.00039986424962989986, "19": 0.21071656048297882}}, {"key": "liguori2021shellcode_ia32", "year": "2021", "title": "Shellcode_IA32: A Dataset for Automatic Shellcode Generation", "topic_distr": {"0": 0.00346619775518775, "1": 0.0028288187459111214, "2": 0.0023911993484944105, "3": 0.002071059076115489, "4": 0.001826598308980465, "5": 0.03906843811273575, "6": 0.17536847293376923, "7": 0.0013489346019923687, "8": 0.0012407787144184113, "9": 0.0011486790608614683, "10": 0.001069307210855186, "11": 0.0010001955088227987, "12": 0.0009394750231876969, "13": 0.0008857050561346114, "14": 0.1580965518951416, "15": 0.0007947335252538323, "16": 0.0007559133227914572, "17": 0.0007207089802250266, "18": 0.0006886377232149243, "19": 0.6042895317077637}}, {"key": "lin2017program", "year": "2017", "title": "Program Synthesis from Natural Language Using Recurrent Neural Networks", "topic_distr": {"0": 0.0017825402319431305, "1": 0.1878463625907898, "2": 0.0012298035435378551, "3": 0.0010651522316038609, "4": 0.11467429995536804, "5": 0.0008402450475841761, "6": 0.0007600085809826851, "7": 0.0006937602302059531, "8": 0.07257568091154099, "9": 0.0005907683516852558, "10": 0.0005499472608789802, "11": 0.0005144029855728149, "12": 0.05319686979055405, "13": 0.00045552023220807314, "14": 0.05398821458220482, "15": 0.0004087333509232849, "16": 0.0003887680359184742, "17": 0.00037066233926452696, "18": 0.30938631296157837, "19": 0.19868192076683044}}, {"key": "lin2018nl2bash", "year": "2018", "title": "NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System", "topic_distr": {"0": 0.003465461079031229, "1": 0.002828798955306411, "2": 0.002391279675066471, "3": 0.0020710777025669813, "4": 0.0018266079714521766, "5": 0.0016337635461241007, "6": 0.4164317846298218, "7": 0.0013489405391737819, "8": 0.0012407841859385371, "9": 0.0011486841831356287, "10": 0.0010693119838833809, "11": 0.00100020004902035, "12": 0.0009394792141392827, "13": 0.0008857090142555535, "14": 0.0008377606864087284, "15": 0.0007947370759211481, "16": 0.0007559166988357902, "17": 0.0007207121816463768, "18": 0.08788540959358215, "19": 0.47072362899780273}}, {"key": "lin2019impact", "year": "2019", "title": "On the Impact of Refactoring Operations on Code Naturalness", "topic_distr": {"0": 0.8685792088508606, "1": 0.002483788877725601, "2": 0.002099613891914487, "3": 0.0018185609951615334, "4": 0.0016038985922932625, "5": 0.0014345694798976183, "6": 0.0012975798454135656, "7": 0.001184472581371665, "8": 0.0010895030573010445, "9": 0.11107528954744339, "10": 0.0009389374172315001, "11": 0.0008782517979852855, "12": 0.0008249343372881413, "13": 0.000777720008045435, "14": 0.000735617708414793, "15": 0.0006978397141210735, "16": 0.0006637524929828942, "17": 0.00063284020870924, "18": 0.0006046791095286608, "19": 0.0005789175047539175}}, {"key": "ling2016latent", "year": "2016", "title": "Latent Predictor Networks for Code Generation", "topic_distr": {"0": 0.002546637551859021, "1": 0.22569772601127625, "2": 0.0017568116309121251, "3": 0.0015216313768178225, "4": 0.32185670733451843, "5": 0.0012003418523818254, "6": 0.0010857192100957036, "7": 0.000991079374216497, "8": 0.0009116158471442759, "9": 0.0008439490920864046, "10": 0.0007856336305849254, "11": 0.0007348563522100449, "12": 0.000690244254656136, "13": 0.0006507387734018266, "14": 0.0006155106821097434, "15": 0.0005839008372277021, "16": 0.000555379141587764, "17": 0.0005295140435919166, "18": 0.000505950883962214, "19": 0.4359360337257385}}, {"key": "ling2020adaptive", "year": "2020", "title": "Adaptive Deep Code Search", "topic_distr": {"0": 0.0016009428072720766, "1": 0.23049037158489227, "2": 0.5474363565444946, "3": 0.0009558852761983871, "4": 0.000843048794195056, "5": 0.0007540440419688821, "6": 0.0006820391281507909, "7": 0.0006225872784852982, "8": 0.2122298777103424, "9": 0.0005301613127812743, "10": 0.00049352808855474, "11": 0.00046163026127032936, "12": 0.00043360530980862677, "13": 0.00040878832805901766, "14": 0.00038665832835249603, "15": 0.00036680130870081484, "16": 0.0003488842339720577, "17": 0.0003326360019855201, "18": 0.0003178338520228863, "19": 0.00030429294565692544}}, {"key": "ling2020deep", "year": "2020", "title": "Deep Graph Matching and Searching for Semantic Code Retrieval", "topic_distr": {"0": 0.17145511507987976, "1": 0.20634248852729797, "2": 0.0008695300784893334, "3": 0.09772947430610657, "4": 0.0006642254302278161, "5": 0.0005941003910265863, "6": 0.06667537242174149, "7": 0.0004905274836346507, "8": 0.0004511976439971477, "9": 0.0004177064693067223, "10": 0.0003888436476700008, "11": 0.00036371182068251073, "12": 0.15501366555690765, "13": 0.00032207841286435723, "14": 0.00030464251176454127, "15": 0.0002889974566642195, "16": 0.000274880847427994, "17": 0.0002620791201479733, "18": 0.00025041672051884234, "19": 0.29684096574783325}}, {"key": "liu2016towards", "year": "2016", "title": "Towards Better Program Obfuscation: Optimization via Language Models", "topic_distr": {"0": 0.15032878518104553, "1": 0.002425215905532241, "2": 0.002049637958407402, "3": 0.0017752446001395583, "4": 0.0015656912000849843, "5": 0.0014003947144374251, "6": 0.16689874231815338, "7": 0.0011562558356672525, "8": 0.0010635487269610167, "9": 0.0009846043540164828, "10": 0.0009165698429569602, "11": 0.0008573298691771924, "12": 0.000805282557848841, "13": 0.0007591929752379656, "14": 0.0007180936518125236, "15": 0.0006812156061641872, "16": 0.0006479403818957508, "17": 0.2433803230524063, "18": 0.4210208058357239, "19": 0.0005651263636536896}}, {"key": "liu2018neural", "year": "2018", "title": "Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?", "topic_distr": {"0": 0.12208010256290436, "1": 0.08948788046836853, "2": 0.0009256565244868398, "3": 0.0008017166401259601, "4": 0.0007070821593515575, "5": 0.0006324331043288112, "6": 0.0005720409099012613, "7": 0.0005221773171797395, "8": 0.00048030982725322247, "9": 0.00044465772225521505, "10": 0.0004139326047152281, "11": 0.4971276819705963, "12": 0.0003636741021182388, "13": 0.0003428595664445311, "14": 0.13365139067173004, "15": 0.00030764416442252696, "16": 0.15033793449401855, "17": 0.0002789889695122838, "18": 0.000266574090346694, "19": 0.0002552170481067151}}, {"key": "liu2019deepfuzz", "year": "2019", "title": "DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing", "topic_distr": {"0": 0.0018348157173022628, "1": 0.001497559598647058, "2": 0.1778886318206787, "3": 0.0010964598041027784, "4": 0.05421765521168709, "5": 0.5590091943740845, "6": 0.0007823471678420901, "7": 0.0007141516543924809, "8": 0.1461418718099594, "9": 0.05239567533135414, "10": 0.0005661116447299719, "11": 0.0005295226001180708, "12": 0.0004973759641870856, "13": 0.00046890912926755846, "14": 0.00044352447730489075, "15": 0.00042074709199368954, "16": 0.00040019492735154927, "17": 0.0003815570380538702, "18": 0.00036457795067690313, "19": 0.0003490455274004489}}, {"key": "liu2019generating", "year": "2019", "title": "Generating commit messages from diffs using pointer-generator network", "topic_distr": {"0": 0.0019187239231541753, "1": 0.21247434616088867, "2": 0.09088318049907684, "3": 0.0011470657773315907, "4": 0.0010116653284057975, "5": 0.0009048599167726934, "6": 0.0008184529724530876, "7": 0.0007471101707778871, "8": 0.000687207852024585, "9": 0.0006361982668749988, "10": 0.0005922380369156599, "11": 0.38796672224998474, "12": 0.0005203302134759724, "13": 0.159077450633049, "14": 0.1386098712682724, "15": 0.0004401648184284568, "16": 0.00041866415995173156, "17": 0.00039916616515256464, "18": 0.00038140345714055, "19": 0.00036515420651994646}}, {"key": "liu2019learning", "year": "2019", "title": "Learning to Sport and Refactor Inconsistent Method Names", "topic_distr": {"0": 0.0018345206044614315, "1": 0.0014975046506151557, "2": 0.5924496650695801, "3": 0.0010964764514937997, "4": 0.0009670373401604593, "5": 0.000864944071508944, "6": 0.0007823489140719175, "7": 0.0007141532259993255, "8": 0.0006568933604285121, "9": 0.0006081339088268578, "10": 0.000566112925298512, "11": 0.0005295237642712891, "12": 0.000497377070132643, "13": 0.39457571506500244, "14": 0.0004435254668351263, "15": 0.00042074802331626415, "16": 0.0004001958295702934, "17": 0.0003815579111687839, "18": 0.0003645787655841559, "19": 0.00034904631320387125}}, {"key": "liu2019neural", "year": "2019", "title": "Neural query expansion for code search", "topic_distr": {"0": 0.001559642143547535, "1": 0.9887694716453552, "2": 0.00107609445694834, "3": 0.0009319831733591855, "4": 0.0008219753508456051, "5": 0.0007351957610808313, "6": 0.0006649907445535064, "7": 0.0006070249364711344, "8": 0.0005583544261753559, "9": 0.000516909291036427, "10": 0.00048119176062755287, "11": 0.00045009126188233495, "12": 0.0004227668105158955, "13": 0.0003985701478086412, "14": 0.00037699335371144116, "15": 0.0003576326707843691, "16": 0.00034016347490251064, "17": 0.0003243213868699968, "18": 0.0003098892339039594, "19": 0.0002966867759823799}}, {"key": "liu2020automating", "year": "2020", "title": "Automating Just-In-Time Comment Updating", "topic_distr": {"0": 0.001521967351436615, "1": 0.0012419286649674177, "2": 0.001049837446771562, "3": 0.0009092579130083323, "4": 0.000801931950263679, "5": 0.0007172685000114143, "6": 0.0006487751961685717, "7": 0.0005922229029238224, "8": 0.0005447392468340695, "9": 0.0005043047131039202, "10": 0.6353570222854614, "11": 0.08766051381826401, "12": 0.00041245785541832447, "13": 0.00038885121466591954, "14": 0.0003678005305118859, "15": 0.07332923263311386, "16": 0.00033186873770318925, "17": 0.0003164129448123276, "18": 0.00030233271536417305, "19": 0.19300128519535065}}, {"key": "liu2022open", "year": "2022", "title": "Open-ended Knowledge Tracing", "topic_distr": {"0": 0.2540387809276581, "1": 0.0018857457907870412, "2": 0.21010184288024902, "3": 0.0013807372888550162, "4": 0.11849276721477509, "5": 0.0010891933925449848, "6": 0.00098518468439579, "7": 0.14882272481918335, "8": 0.0008272028644569218, "9": 0.07693584263324738, "10": 0.0007128861616365612, "11": 0.0006668107816949487, "12": 0.0006263295654207468, "13": 0.0005904822028242052, "14": 0.0005585161270573735, "15": 0.000529833254404366, "16": 0.0005039526149630547, "17": 0.0004804825293831527, "18": 0.18033112585544586, "19": 0.00043954182183369994}}, {"key": "liu2023code", "year": "2023", "title": "Code Execution with Pre-trained Language Models", "topic_distr": {"0": 0.30073168873786926, "1": 0.03837064281105995, "2": 0.0016554509056732059, "3": 0.0014338470064103603, "4": 0.1983105093240738, "5": 0.0011310793925076723, "6": 0.0010230705374851823, "7": 0.0009338917443528771, "8": 0.06646306812763214, "9": 0.0007952512823976576, "10": 0.0007403007475659251, "11": 0.0006924534682184458, "12": 0.06150174140930176, "13": 0.0006131895934231579, "14": 0.0005799942882731557, "15": 0.3230683505535126, "16": 0.0005233324482105672, "17": 0.000498959852848202, "18": 0.00047675633686594665, "19": 0.0004564447153825313}}, {"key": "lomshakov2023fine", "year": "2023", "title": "Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets", "topic_distr": {"0": 0.6579574346542358, "1": 0.0022629655431956053, "2": 0.11062072217464447, "3": 0.0016568897990509868, "4": 0.07355823367834091, "5": 0.0013070375425741076, "6": 0.001182226580567658, "7": 0.0010791743407025933, "8": 0.0009926476050168276, "9": 0.000918966019526124, "10": 0.0008554669911973178, "11": 0.000800176290795207, "12": 0.0007515986217185855, "13": 0.0007085815886966884, "14": 0.0006702221580781043, "15": 0.0006358025711961091, "16": 0.0006047456408850849, "17": 0.0005765814566984773, "18": 0.14233307540416718, "19": 0.0005274523864500225}}, {"key": "louis2018deep", "year": "2018", "title": "Deep Learning to Detect Redundant Method Comments", "topic_distr": {"0": 0.001807291293516755, "1": 0.0014757926110178232, "2": 0.0012475773692131042, "3": 0.001080561662092805, "4": 0.0009530111565254629, "5": 0.0008523978176526725, "6": 0.0007710008649155498, "7": 0.0007037944160401821, "8": 0.000647365057375282, "9": 0.0005993128870613873, "10": 0.0005579013959504664, "11": 0.0005218429723754525, "12": 0.0004901625798083842, "13": 0.00046210861182771623, "14": 0.00043709209421649575, "15": 0.00041464503738097847, "16": 0.9858988523483276, "17": 0.00037602338124997914, "18": 0.00035929051227867603, "19": 0.0003439833817537874}}, {"key": "louis2020where", "year": "2020", "title": "Where should I comment my code? A dataset and model for predicting locations that need comments", "topic_distr": {"0": 0.0033741514198482037, "1": 0.0027520463336259127, "2": 0.002326604910194874, "3": 0.0020151513163000345, "4": 0.0017772821011021733, "5": 0.0015896448167040944, "6": 0.0014378466876223683, "7": 0.0013125126715749502, "8": 0.0012072770623490214, "9": 0.0011176641564816236, "10": 0.29408976435661316, "11": 0.0009731898317113519, "12": 0.0009141087648458779, "13": 0.000861790613271296, "14": 0.000815137114841491, "15": 0.0007732753874734044, "16": 0.000735503330361098, "17": 0.0007012494606897235, "18": 0.48643139004707336, "19": 0.19479435682296753}}, {"key": "loyola2017neural", "year": "2017", "title": "A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes", "topic_distr": {"0": 0.0033723236992955208, "1": 0.0027523566968739033, "2": 0.002326550194993615, "3": 0.0020151089411228895, "4": 0.0017772451974451542, "5": 0.001589614199474454, "6": 0.0014378194464370608, "7": 0.0013124877586960793, "8": 0.0012072542449459434, "9": 0.0011176429688930511, "10": 0.0010404157219454646, "11": 0.3576403558254242, "12": 0.0009140914189629257, "13": 0.0008617742569185793, "14": 0.2845243811607361, "15": 0.0007732607191428542, "16": 0.0007354893605224788, "17": 0.0007012361893430352, "18": 0.06705304980278015, "19": 0.26684755086898804}}, {"key": "loyola2018content", "year": "2018", "title": "Content Aware Source Code Change Description Generation", "topic_distr": {"0": 0.002710657427087426, "1": 0.0022141668014228344, "2": 0.0018714305479079485, "3": 0.0016208798624575138, "4": 0.0014295481378212571, "5": 0.0012786248698830605, "6": 0.0011565270833671093, "7": 0.0010557150235399604, "8": 0.0009710691520012915, "9": 0.033619508147239685, "10": 0.0008368706912733614, "11": 0.28359946608543396, "12": 0.0007352601969614625, "13": 0.0006931783282198012, "14": 0.1759849190711975, "15": 0.0006219813949428499, "16": 0.0005915995570831001, "17": 0.0005640476010739803, "18": 0.0005389477591961622, "19": 0.4879055917263031}}, {"key": "lu2019program", "year": "2019", "title": "Program Classification Using Gated Graph Attention Neural Network for Online Programming Service", "topic_distr": {"0": 0.0013418099842965603, "1": 0.0010951363947242498, "2": 0.0009256459889002144, "3": 0.0008017276413738728, "4": 0.0007070920546539128, "5": 0.0006324410205706954, "6": 0.0005720481858588755, "7": 0.11916300654411316, "8": 0.16137054562568665, "9": 0.00044466336839832366, "10": 0.00041393787250854075, "11": 0.00038718414725735784, "12": 0.48225855827331543, "13": 0.0003428639320190996, "14": 0.00032430278952233493, "15": 0.00030764806433580816, "16": 0.000292620446998626, "17": 0.0002789925201795995, "18": 0.22808457911014557, "19": 0.0002552202786318958}}, {"key": "lu2021codexglue", "year": "2021", "title": "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation", "topic_distr": {"0": 0.5292558073997498, "1": 0.002078078221529722, "2": 0.0017567769391462207, "3": 0.0015215895837172866, "4": 0.18907581269741058, "5": 0.0012003027368336916, "6": 0.0010856838198378682, "7": 0.0009910471271723509, "8": 0.0009115862776525319, "9": 0.0008439216762781143, "10": 0.0007856081356294453, "11": 0.0007348325452767313, "12": 0.0006902218447066844, "13": 0.000650717644020915, "14": 0.0006154906586743891, "15": 0.0005838818615302444, "16": 0.0005553610972128808, "17": 0.0005294968141242862, "18": 0.12084860354661942, "19": 0.14528514444828033}}, {"key": "lu2022reacc", "year": "2022", "title": "ReACC: A Retrieval-Augmented Code Completion Framework", "topic_distr": {"0": 0.0021892746444791555, "1": 0.0017866562120616436, "2": 0.001510221161879599, "3": 0.0013080654898658395, "4": 0.001153669785708189, "5": 0.0010318707209080458, "6": 0.0009333357447758317, "7": 0.0008519788971170783, "8": 0.0007836683071218431, "9": 0.0007254987140186131, "10": 0.6276405453681946, "11": 0.0006317174411378801, "12": 0.000593366741668433, "13": 0.0005594059475697577, "14": 0.03815346211194992, "15": 0.0005019488744437695, "16": 0.00047743029426783323, "17": 0.00045519540435634553, "18": 0.00043493942939676344, "19": 0.3182777762413025}}, {"key": "luan2019aroma", "year": "2015", "title": "Aroma: code recommendation via structural code search", "topic_distr": {"0": 0.0014673031400889158, "1": 0.9261366128921509, "2": 0.0010127630084753036, "3": 0.0008771759457886219, "4": 0.0007736333645880222, "5": 0.0006919574807398021, "6": 0.0006258813664317131, "7": 0.06386540085077286, "8": 0.0005255165160633624, "9": 0.0004865088558290154, "10": 0.000452891894383356, "11": 0.0004236204840708524, "12": 0.0003979030589107424, "13": 0.00037512945709750056, "14": 0.00035482161911204457, "15": 0.0003365995944477618, "16": 0.000320157763781026, "17": 0.00030524737667292356, "18": 0.00029166400781832635, "19": 0.00027923804009333253}}, {"key": "maddison2014structured", "year": "2014", "title": "Structured Generative Models of Natural Source Code", "topic_distr": {"0": 0.0035631919745355844, "1": 0.002909228904172778, "2": 0.0024594906717538834, "3": 0.0021302306558936834, "4": 0.0018787913722917438, "5": 0.0016804365441203117, "6": 0.972943127155304, "7": 0.0013874766882508993, "8": 0.0012762305559590459, "9": 0.001181499450467527, "10": 0.001099859829992056, "11": 0.0010287733748555183, "12": 0.0009663179516792297, "13": 0.0009110116516239941, "14": 0.0008616935228928924, "15": 0.0008174408576451242, "16": 0.0007775115082040429, "17": 0.0007413012208417058, "18": 0.0007083136588335037, "19": 0.0006781368283554912}}, {"key": "mahmud2021code", "year": "2021", "title": "Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors", "topic_distr": {"0": 0.3175026476383209, "1": 0.0016164887929335237, "2": 0.0013664138969033957, "3": 0.0011834832839667797, "4": 0.0010437832679599524, "5": 0.0009335868526250124, "6": 0.0008444370469078422, "7": 0.1002635657787323, "8": 0.00070902518928051, "9": 0.0006563961505889893, "10": 0.0006110402755439281, "11": 0.000571547425352037, "12": 0.0005368495476432145, "13": 0.0005061234696768224, "14": 0.15122635662555695, "15": 0.0004541390808299184, "16": 0.0004319558502174914, "17": 0.0004118387878406793, "18": 0.000393512164009735, "19": 0.4187368154525757}}, {"key": "malik2019nl2type", "year": "2019", "title": "NL2Type: Inferring JavaScript Function Types from Natural Language Information", "topic_distr": {"0": 0.0014332791324704885, "1": 0.001170871197246015, "2": 0.0009894439717754722, "3": 0.0008570046629756689, "4": 0.0007558446377515793, "5": 0.0006760472897440195, "6": 0.0006114904535934329, "7": 0.0005581881268881261, "8": 0.0005134333041496575, "9": 0.9889784455299377, "10": 0.00044247854384593666, "11": 0.0004138801887165755, "12": 0.00038875406607985497, "13": 0.0003665041003841907, "14": 0.0003466632042545825, "15": 0.00032886015833355486, "16": 0.00031279638642445207, "17": 0.00029822884243912995, "18": 0.00028495778678916395, "19": 0.0002728175022639334}}, {"key": "mammadli2020static", "year": "2020", "title": "Static Neural Compiler Optimization via Deep Reinforcement Learning", "topic_distr": {"0": 0.0018075573025271297, "1": 0.0014762500068172812, "2": 0.0012475880794227123, "3": 0.0010805597994476557, "4": 0.0009530140669085085, "5": 0.00085240084445104, "6": 0.0007710037170909345, "7": 0.0007037969771772623, "8": 0.0006473674438893795, "9": 0.000599315098952502, "10": 0.0005579034332185984, "11": 0.0005218448932282627, "12": 0.0004901643842458725, "13": 0.0004621102998498827, "14": 0.00043709369492717087, "15": 0.0004146465507801622, "16": 0.00039439238025806844, "17": 0.9858797788619995, "18": 0.0003592918219510466, "19": 0.00034398463321849704}}, {"key": "mangal2015user", "year": "2015", "title": "A User-Guided Approach to Program Analysis", "topic_distr": {"0": 0.0021506105549633503, "1": 0.0017556912498548627, "2": 0.001484145293943584, "3": 0.001285497099161148, "4": 0.0011337577598169446, "5": 0.001014062319882214, "6": 0.0009172277059406042, "7": 0.0008372748852707446, "8": 0.0007701432914473116, "9": 0.0007129776058718562, "10": 0.0006637120968662202, "11": 0.000620814913418144, "12": 0.000583126035053283, "13": 0.0005497513921000063, "14": 0.0005199902807362378, "15": 0.0004932859446853399, "16": 0.00046919050510041416, "17": 0.00044733937829732895, "18": 0.9831821918487549, "19": 0.0004092227027285844}}, {"key": "markovtsev2017topic", "year": "2017", "title": "Topic modeling of public repositories at scale using names in source code", "topic_distr": {"0": 0.0021166447550058365, "1": 0.0017260363092646003, "2": 0.001459021819755435, "3": 0.0012637190520763397, "4": 0.03668152168393135, "5": 0.0009968850063160062, "6": 0.20061227679252625, "7": 0.0008230922394432127, "8": 0.16443634033203125, "9": 0.0007009004475548863, "10": 0.3506229817867279, "11": 0.0006102988263592124, "12": 0.0005732484278269112, "13": 0.2346574366092682, "14": 0.0005111821228638291, "15": 0.00048493011854588985, "16": 0.0004612428310792893, "17": 0.00043976184679195285, "18": 0.0004201926349196583, "19": 0.00040229083970189095}}, {"key": "markovtsev2018public", "year": "2018", "title": "Public Git Archive: a Big Code dataset for all", "topic_distr": {"0": 0.8952091932296753, "1": 0.0017555919475853443, "2": 0.0014842043165117502, "3": 0.001285502570681274, "4": 0.001133767538703978, "5": 0.0010140715166926384, "6": 0.0009172362624667585, "7": 0.0008372827433049679, "8": 0.000770150450989604, "9": 0.0007129842997528613, "10": 0.0006637182668782771, "11": 0.0006208206759765744, "12": 0.0005831315065734088, "13": 0.0005497565143741667, "14": 0.0005199951701797545, "15": 0.0004932905430905521, "16": 0.00046919487067498267, "17": 0.00044734354014508426, "18": 0.00042743695667013526, "19": 0.09010534733533859}}, {"key": "markovtsev2019style", "year": "2019", "title": "STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms", "topic_distr": {"0": 0.0018630947452038527, "1": 0.0015200509224087, "2": 0.001284850761294365, "3": 0.0011128485202789307, "4": 0.000981488381512463, "5": 0.2169446498155594, "6": 0.0007940389914438128, "7": 0.0007248243200592697, "8": 0.14822477102279663, "9": 0.000617220823187381, "10": 0.057036854326725006, "11": 0.4302990436553955, "12": 0.0005048090242780745, "13": 0.00047591677866876125, "14": 0.00045015275827609, "15": 0.13564762473106384, "16": 0.0004061756480950862, "17": 0.00038725926424376667, "18": 0.0003700263914652169, "19": 0.0003542618651408702}}, {"key": "mastropaolo2022using", "year": "2022", "title": "Using Deep Learning to Generate Complete Log Statements", "topic_distr": {"0": 0.0021871651988476515, "1": 0.0017864654073491693, "2": 0.28263628482818604, "3": 0.0013080572243779898, "4": 0.00115364626981318, "5": 0.0010318511631339788, "6": 0.5688948035240173, "7": 0.0008519627735950053, "8": 0.0007836534059606493, "9": 0.0007254849188029766, "10": 0.0006753551424480975, "11": 0.13399752974510193, "12": 0.0005933554493822157, "13": 0.0005593952955678105, "14": 0.0005291121196933091, "15": 0.0005019393283873796, "16": 0.00047742121387273073, "17": 0.0004551867605186999, "18": 0.00043493116390891373, "19": 0.00041640145354904234}}, {"key": "mehrotra2020modeling", "year": "2020", "title": "Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks", "topic_distr": {"0": 0.1937921643257141, "1": 0.0009984254138544202, "2": 0.0008439529919996858, "3": 0.5138715505599976, "4": 0.0006446980405598879, "5": 0.000576634076423943, "6": 0.0005215703276917338, "7": 0.00047610612818971276, "8": 0.0004379325546324253, "9": 0.00040542599163018167, "10": 0.00037741175037808716, "11": 0.0003530187823344022, "12": 0.28481534123420715, "13": 0.00031260939431376755, "14": 0.0002956861280836165, "15": 0.00028050102991983294, "16": 0.0002667994413059205, "17": 0.0002543740556575358, "18": 0.0002430545282550156, "19": 0.00023269948724191636}}, {"key": "menon2013machine", "year": "2013", "title": "A Machine Learning Framework for Programming by Example", "topic_distr": {"0": 0.0031184281688183546, "1": 0.2731814682483673, "2": 0.0021521244198083878, "3": 0.0018640008056536317, "4": 0.0016439786413684487, "5": 0.1068313792347908, "6": 0.1500169336795807, "7": 0.0012140683829784393, "8": 0.0011167259654030204, "9": 0.0010338344145566225, "10": 0.0009623981895856559, "11": 0.0009001962607726455, "12": 0.0008455465431325138, "13": 0.0007971524610184133, "14": 0.0007539981743320823, "15": 0.00071527628460899, "16": 0.0006803373107686639, "17": 0.0006486526690423489, "18": 0.45093008875846863, "19": 0.0005933825741522014}}, {"key": "mesbah2019deepdelta", "year": "2019", "title": "DeepDelta: Learning to Repair Compilation Errors", "topic_distr": {"0": 0.002043680986389518, "1": 0.0016695251688361168, "2": 0.0014112088829278946, "3": 0.0012222909135743976, "4": 0.0010780051816254854, "5": 0.0009641966898925602, "6": 0.0008721238118596375, "7": 0.0007961026858538389, "8": 0.0007322722231037915, "9": 0.000677917618304491, "10": 0.0006310746539384127, "11": 0.10926378518342972, "12": 0.0005544513696804643, "13": 0.0005227178917266428, "14": 0.6942578554153442, "15": 0.1816358119249344, "16": 0.0004461185017134994, "17": 0.00042534185922704637, "18": 0.00040641435771249235, "19": 0.00038909955765120685}}, {"key": "mir2021manytypes4py", "year": "2021", "title": "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference", "topic_distr": {"0": 0.49114543199539185, "1": 0.0019581010565161705, "2": 0.0016557266935706139, "3": 0.0014338589971885085, "4": 0.0012646017130464315, "5": 0.0011310928966850042, "6": 0.0010230827610939741, "7": 0.0009339028038084507, "8": 0.0008590236539021134, "9": 0.3864838778972626, "10": 0.0007403094787150621, "11": 0.0006924616172909737, "12": 0.106979601085186, "13": 0.0006131968693807721, "14": 0.0005800010985694826, "15": 0.0005502148997038603, "16": 0.0005233386182226241, "17": 0.0004989657318219543, "18": 0.00047676198300905526, "19": 0.00045645012869499624}}, {"key": "mir2021type4py", "year": "2021", "title": "Type4Py: Deep Similarity Learning-Based Type Inference for Python", "topic_distr": {"0": 0.001732356264255941, "1": 0.07099796831607819, "2": 0.0011955926893278956, "3": 0.07282868027687073, "4": 0.000913306896109134, "5": 0.0008168843924067914, "6": 0.00073887879261747, "7": 0.0006744723650626838, "8": 0.0006203940138220787, "9": 0.8453055620193481, "10": 0.0005346576799638569, "11": 0.0005001015379093587, "12": 0.00046974103315733373, "13": 0.0004428558750078082, "14": 0.0004188816237729043, "15": 0.0003973697603214532, "16": 0.00037795951357111335, "17": 0.0003603571967687458, "18": 0.0003443214518483728, "19": 0.0003296520735602826}}, {"key": "mohajer2023skipanalyzer", "year": "2023", "title": "SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models", "topic_distr": {"0": 0.1275348663330078, "1": 0.0010950136929750443, "2": 0.0009255970362573862, "3": 0.0008017151849344373, "4": 0.04965602979063988, "5": 0.6322616338729858, "6": 0.0005720393965020776, "7": 0.0005221759784035385, "8": 0.00048030854668468237, "9": 0.1005566418170929, "10": 0.0004139315278735012, "11": 0.00038717821007594466, "12": 0.00036367314169183373, "13": 0.0003428586642257869, "14": 0.00032429781276732683, "15": 0.08266869932413101, "16": 0.00029261596500873566, "17": 0.0002789882419165224, "18": 0.00026657339185476303, "19": 0.0002552163787186146}}, {"key": "monperrus2021megadiff", "year": "2021", "title": "Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size", "topic_distr": {"0": 0.3342483639717102, "1": 0.004628300201147795, "2": 0.003912839572876692, "3": 0.0033889911137521267, "4": 0.002988968277350068, "5": 0.002673414070159197, "6": 0.27296438813209534, "7": 0.002207342302426696, "8": 0.0020303605124354362, "9": 0.0018796523800119758, "10": 0.001749771530739963, "11": 0.17269673943519592, "12": 0.0015373192727565765, "13": 0.0014493322232738137, "14": 0.0013708719052374363, "15": 0.18565131723880768, "16": 0.0012369463220238686, "17": 0.0011793392477557063, "18": 0.001126859220676124, "19": 0.0010788507061079144}}, {"key": "mou2014building", "year": "2014", "title": "Building Program Vector Representations for Deep Learning", "topic_distr": {"0": 0.0014507384039461613, "1": 0.001184189459308982, "2": 0.0010009559337049723, "3": 0.000866983609739691, "4": 0.0007646228768862784, "5": 0.0006838985136710107, "6": 0.0006185918464325368, "7": 0.0005646705976687372, "8": 0.0005193959805183113, "9": 0.0004808426310773939, "10": 0.0004476172325666994, "11": 0.000418686744524166, "12": 0.00039326882688328624, "13": 0.0003707604482769966, "14": 0.0003506891371216625, "15": 0.0003326793375890702, "16": 0.00031642901012673974, "17": 0.9886707067489624, "18": 0.0002882671251427382, "19": 0.0002759858325589448}}, {"key": "mou2016convolutional", "year": "2016", "title": "Convolutional Neural Networks over Tree Structures for Programming Language Processing", "topic_distr": {"0": 0.19159068167209625, "1": 0.0016165249980986118, "2": 0.0013664079597219825, "3": 0.001183482352644205, "4": 0.001043782220222056, "5": 0.0009335845243185759, "6": 0.0008444349514320493, "7": 0.0007708274060860276, "8": 0.07081744074821472, "9": 0.0006563945207744837, "10": 0.0006110387621447444, "11": 0.0005715459701605141, "12": 0.724940836429596, "13": 0.0005061221891082823, "14": 0.00047872299910523, "15": 0.00045413794578053057, "16": 0.0004319547733757645, "17": 0.0004118377692066133, "18": 0.00039351117447949946, "19": 0.0003767461166717112}}, {"key": "movshovitz2013natural", "year": "2013", "title": "Natural Language Models for Predicting Programming Comments", "topic_distr": {"0": 0.07766877859830856, "1": 0.0022135779727250338, "2": 0.001871325890533626, "3": 0.0016208544839173555, "4": 0.0014295339351519942, "5": 0.0012786120641976595, "6": 0.0011565150925889611, "7": 0.0010557041969150305, "8": 0.0009710591984912753, "9": 0.0008989800699055195, "10": 0.5581966042518616, "11": 0.0007827738299965858, "12": 0.0007352526881732047, "13": 0.03962354734539986, "14": 0.0006556459702551365, "15": 0.0006219749921001494, "16": 0.0005915935034863651, "17": 0.000564041780307889, "18": 0.0005389421712607145, "19": 0.3075246810913086}}, {"key": "movshovitz2015kb", "year": "2015", "title": "KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts", "topic_distr": {"0": 0.002771335421130061, "1": 0.3035615384578705, "2": 0.0019130654400214553, "3": 0.001656893058679998, "4": 0.001461314968764782, "5": 0.16917210817337036, "6": 0.2051849365234375, "7": 0.0010791743407025933, "8": 0.3055990934371948, "9": 0.0009189659613184631, "10": 0.0008554669329896569, "11": 0.0008001762325875461, "12": 0.0007515985635109246, "13": 0.0007085815304890275, "14": 0.0006702220998704433, "15": 0.0006358025129884481, "16": 0.000604745582677424, "17": 0.0005765813984908164, "18": 0.0005509238108061254, "19": 0.0005274523864500225}}, {"key": "muennighoff2023octopack", "year": "2023", "title": "OctoPack: Instruction Tuning Code Large Language Models", "topic_distr": {"0": 0.8831421136856079, "1": 0.0019212455954402685, "2": 0.001624189200811088, "3": 0.0014067752053961158, "4": 0.0012407245812937617, "5": 0.001109735807403922, "6": 0.0010037653846666217, "7": 0.0009162693750113249, "8": 0.08031380921602249, "9": 0.0007802449981682003, "10": 0.0007263313746079803, "11": 0.0006793869542889297, "12": 0.0006381422863341868, "13": 0.0006016188417561352, "14": 0.0005690498510375619, "15": 0.021407999098300934, "16": 0.0005134572274982929, "17": 0.000489544530864805, "18": 0.00046776002272963524, "19": 0.0004478316695895046}}, {"key": "mukherjee2020searching", "year": "2020", "title": "Searching a Database of Source Codes Using Contextualized Code Search", "topic_distr": {"0": 0.0017821919173002243, "1": 0.673105776309967, "2": 0.001229761866852641, "3": 0.0010651459451764822, "4": 0.0009394132066518068, "5": 0.0008402357343584299, "6": 0.12367432564496994, "7": 0.0006937526632100344, "8": 0.0006381284911185503, "9": 0.0005907619488425553, "10": 0.0005499412654899061, "11": 0.0005143973394297063, "12": 0.000483168987557292, "13": 0.00045551525545306504, "14": 0.0004308556963223964, "15": 0.000408728898037225, "16": 0.0003887637867592275, "17": 0.0003706582938320935, "18": 0.19149935245513916, "19": 0.00033907542820088565}}, {"key": "mukherjee2021neural", "year": "2021", "title": "Neural Program Generation Modulo Static Analysis", "topic_distr": {"0": 0.0023089719470590353, "1": 0.0018858994590118527, "2": 0.3361108601093292, "3": 0.0013807346113026142, "4": 0.0012177462922409177, "5": 0.0010891822166740894, "6": 0.0009851740906015038, "7": 0.0008992986404336989, "8": 0.000827194016892463, "9": 0.13143274188041687, "10": 0.0007128785946406424, "11": 0.0006668036803603172, "12": 0.09319356828927994, "13": 0.0005904759163968265, "14": 0.2262359857559204, "15": 0.0005298276082612574, "16": 0.0005039472016505897, "17": 0.00048047740710899234, "18": 0.19850873947143555, "19": 0.0004395371361169964}}, {"key": "murali2017bayesian", "year": "2018", "title": "Bayesian Sketch Learning for Program Synthesis", "topic_distr": {"0": 0.001327177626080811, "1": 0.0010835310677066445, "2": 0.000915770884603262, "3": 0.0007931876461952925, "4": 0.09610915929079056, "5": 0.0006257046479731798, "6": 0.0005659550661221147, "7": 0.0005166219780221581, "8": 0.000475199893116951, "9": 0.1270124465227127, "10": 0.00040952887502498925, "11": 0.0003830601053778082, "12": 0.00035980503889732063, "13": 0.348248153924942, "14": 0.0003208485140930861, "15": 0.0003043712058570236, "16": 0.000289503630483523, "17": 0.00027602087357081473, "18": 0.419731467962265, "19": 0.00025250183534808457}}, {"key": "murali2017finding", "year": "2017", "title": "Finding Likely Errors with Bayesian Specifications", "topic_distr": {"0": 0.0017314538126811385, "1": 0.18039822578430176, "2": 0.0011955968802794814, "3": 0.0010355537524446845, "4": 0.0009133148123510182, "5": 0.0008168927161023021, "6": 0.000738886184990406, "7": 0.08409754931926727, "8": 0.0006204001838341355, "9": 0.0005743495421484113, "10": 0.0005346629768610001, "11": 0.0005001065437681973, "12": 0.00046974571887403727, "13": 0.32845088839530945, "14": 0.0004188857856206596, "15": 0.00039737371844239533, "16": 0.00037796326796524227, "17": 0.00036036077653989196, "18": 0.39603814482688904, "19": 0.0003296553622931242}}, {"key": "nadeem2022codedsi", "year": "2022", "title": "CodeDSI: Differentiable Code Search", "topic_distr": {"0": 0.4316382110118866, "1": 0.4006223976612091, "2": 0.0012298040091991425, "3": 0.001065141404978931, "4": 0.0009394062799401581, "5": 0.0008402290986850858, "6": 0.0007599940872751176, "7": 0.0006937470170669258, "8": 0.0006381232524290681, "9": 0.0005907571176066995, "10": 0.0005499367835000157, "11": 0.0005143931484781206, "12": 0.00048316502943634987, "13": 0.0004555115301627666, "14": 0.00043085217475891113, "15": 0.0004087255510967225, "16": 0.00038876061444170773, "17": 0.000370655267033726, "18": 0.00035416128230281174, "19": 0.15702608227729797}}, {"key": "naik2022probing", "year": "2022", "title": "Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis", "topic_distr": {"0": 0.16836540400981903, "1": 0.001755943987518549, "2": 0.16993951797485352, "3": 0.25138676166534424, "4": 0.16409799456596375, "5": 0.0010140809463337064, "6": 0.0009172442951239645, "7": 0.0008372901356779039, "8": 0.23578877747058868, "9": 0.0007129905861802399, "10": 0.0006637241458520293, "11": 0.0006208261474967003, "12": 0.0005831366288475692, "13": 0.0005497613456100225, "14": 0.0005199997103773057, "15": 0.0004932949086651206, "16": 0.000469199032522738, "17": 0.0004473474982660264, "18": 0.00042744074016809464, "19": 0.00040923015330918133}}, {"key": "nair2020funcgnn", "year": "2020", "title": "funcGNN: A Graph Neural Network Approach to Program Similarity", "topic_distr": {"0": 0.0010062577202916145, "1": 0.02870800904929638, "2": 0.0006942510954104364, "3": 0.1868625283241272, "4": 0.0005303177167661488, "5": 0.00047432954306714237, "6": 0.0004290349897928536, "7": 0.05300170183181763, "8": 0.0003602359793148935, "9": 0.0003334966313559562, "10": 0.0003104525967501104, "11": 0.0002903873391915113, "12": 0.7254477739334106, "13": 0.0002571472432464361, "14": 0.00024322644458152354, "15": 0.00023073544434737414, "16": 0.00021946475317236036, "17": 0.00020924383716192096, "18": 0.00019993259047623724, "19": 0.00019141469965688884}}, {"key": "nguyen2013lexical", "year": "2013", "title": "Lexical Statistical Machine Translation for Language Migration", "topic_distr": {"0": 0.0018110284581780434, "1": 0.0014757851604372263, "2": 0.0012475989060476422, "3": 0.0010805870406329632, "4": 0.0009530331590212882, "5": 0.0008524176082573831, "6": 0.000771018851082772, "7": 0.5417293310165405, "8": 0.0006473801331594586, "9": 0.0005993268569000065, "10": 0.0005579143762588501, "11": 0.0005218551377765834, "12": 0.0004901740467175841, "13": 0.0004621193802449852, "14": 0.4449120759963989, "15": 0.0004146546998526901, "16": 0.0003944001509808004, "17": 0.0003760321415029466, "18": 0.00035929889418184757, "19": 0.000343991385307163}}, {"key": "nguyen2013statistical", "year": "2013", "title": "A Statistical Semantic Language Model for Source Code", "topic_distr": {"0": 0.0026578253600746393, "1": 0.0021665263921022415, "2": 0.08867689222097397, "3": 0.238158717751503, "4": 0.001399118802510202, "5": 0.0012514094123616815, "6": 0.001131910365074873, "7": 0.0010332440724596381, "8": 0.0009503999608568847, "9": 0.0008798543130978942, "10": 0.35748618841171265, "11": 0.0007661203271709383, "12": 0.0007196101942099631, "13": 0.299308180809021, "14": 0.0006416971446014941, "15": 0.0006087424699217081, "16": 0.0005790073773823678, "17": 0.0005520418053492904, "18": 0.000527476251590997, "19": 0.00050500372890383}}, {"key": "nguyen2013study", "year": "2013", "title": "A Study of Repetitiveness of Code Changes in Software Evolution", "topic_distr": {"0": 0.0020795471500605345, "1": 0.0016972218872979283, "2": 0.0014347119722515345, "3": 0.0012426533503457904, "4": 0.0010959742357954383, "5": 0.0009802691638469696, "6": 0.4953414499759674, "7": 0.0008093731012195349, "8": 0.0007444786024279892, "9": 0.0006892179953865707, "10": 0.0006415941752493382, "11": 0.3433268070220947, "12": 0.0005636936402879655, "13": 0.0005314311711117625, "14": 0.0005026618600822985, "15": 0.1466241329908371, "16": 0.0004535549378488213, "17": 0.00043243198888376355, "18": 0.0004131889727432281, "19": 0.00039558554999530315}}, {"key": "nguyen2014statistical", "year": "2014", "title": "Statistical Learning Approach for Mining API Usage Mappings for Code Migration", "topic_distr": {"0": 0.001559289637953043, "1": 0.0012732177274301648, "2": 0.0010760785080492496, "3": 0.0009320085518993437, "4": 0.0008219865267165005, "5": 0.0007352065877057612, "6": 0.0006650005816482008, "7": 0.7090539932250977, "8": 0.0005583626334555447, "9": 0.0005169169162400067, "10": 0.0004811988037545234, "11": 0.0004500978684518486, "12": 0.0004227730387356132, "13": 0.27944818139076233, "14": 0.00037699888343922794, "15": 0.0003576379385776818, "16": 0.00034016845165751874, "17": 0.00032432613079436123, "18": 0.00030989377410151064, "19": 0.0002966911415569484}}, {"key": "nguyen2015divide", "year": "2014", "title": "Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code", "topic_distr": {"0": 0.001485765795223415, "1": 0.0012123576598241925, "2": 0.12938104569911957, "3": 0.0008876288775354624, "4": 0.0007828508969396353, "5": 0.0007002021884545684, "6": 0.0006333386991173029, "7": 0.2969651520252228, "8": 0.3756234347820282, "9": 0.031662873923778534, "10": 0.00045828812289983034, "11": 0.00042866793228313327, "12": 0.0004026440728921443, "13": 0.08754712343215942, "14": 0.0003590493288356811, "15": 0.00034061018959619105, "16": 0.000323972461046651, "17": 0.00030888442415744066, "18": 0.07021350413560867, "19": 0.00028256516088731587}}, {"key": "nguyen2015graph", "year": "2015", "title": "Graph-based Statistical Language Model for Code", "topic_distr": {"0": 0.001861132332123816, "1": 0.0015197917819023132, "2": 0.0012848300393670797, "3": 0.07349931448698044, "4": 0.0009814659133553505, "5": 0.0008778487099334598, "6": 0.0007940214709378779, "7": 0.0007248083129525185, "8": 0.0006666941335424781, "9": 0.0006172072025947273, "10": 0.2778954803943634, "11": 0.0005374241736717522, "12": 0.08644607663154602, "13": 0.5498990416526794, "14": 0.0004501428047660738, "15": 0.0004270255158189684, "16": 0.00040616668411530554, "17": 0.0003872507077176124, "18": 0.00037001821328885853, "19": 0.00035425403621047735}}, {"key": "nguyen2016learning", "year": "2016", "title": "Learning API Usages from Bytecode: A Statistical Approach", "topic_distr": {"0": 0.0020114677026867867, "1": 0.0016427108785137534, "2": 0.0013884498039260507, "3": 0.0012025954201817513, "4": 0.0010606340365484357, "5": 0.07203122228384018, "6": 0.0008580697467550635, "7": 0.38524511456489563, "8": 0.0007204717840068042, "9": 0.0006669931462965906, "10": 0.030468199402093887, "11": 0.0005807745619677007, "12": 0.000545516493730247, "13": 0.37597495317459106, "14": 0.00048645277274772525, "15": 0.00046147077227942646, "16": 0.0004389293899293989, "17": 0.00041848758701235056, "18": 0.00039986506453715265, "19": 0.12339763343334198}}, {"key": "nguyen2016mapping", "year": "2016", "title": "Mapping API Elements for Code Migration with Vector Representations", "topic_distr": {"0": 0.0012595219304785132, "1": 0.0010285411262884736, "2": 0.0008695161086507142, "3": 0.0007531192968599498, "4": 0.0006642175139859319, "5": 0.0005940941045992076, "6": 0.000537363113835454, "7": 0.9903877377510071, "8": 0.0004511928709689528, "9": 0.00041770204552449286, "10": 0.0003888395440299064, "11": 0.00036370797897689044, "12": 0.00034162774682044983, "13": 0.0003220750077161938, "14": 0.00030463931034319103, "15": 0.00028899440076202154, "16": 0.0002748779661487788, "17": 0.0002620763552840799, "18": 0.0002504140720702708, "19": 0.0002397454809397459}}, {"key": "nguyen2017exploring", "year": "2017", "title": "Exploring API Embedding for API Usages and Applications", "topic_distr": {"0": 0.001123348018154502, "1": 0.0009176117018796504, "2": 0.0007755041588097811, "3": 0.0006717235082760453, "4": 0.0005924119614064693, "5": 0.0005298690521158278, "6": 0.00047927096602506936, "7": 0.9914265871047974, "8": 0.000402416248107329, "9": 0.0003725459682755172, "10": 0.0003468036593403667, "11": 0.0003243889659643173, "12": 0.00030469574267044663, "13": 0.0002872567856684327, "14": 0.00027170596877112985, "15": 0.000257752399193123, "16": 0.00024516202392987907, "17": 0.00023374432930722833, "18": 0.00022334280947688967, "19": 0.0002138275740435347}}, {"key": "nguyen2019graph", "year": "2019", "title": "Graph-based Mining of In-the-Wild, Fine-grained, Semantic Code Change Patterns", "topic_distr": {"0": 0.0013853858690708876, "1": 0.001131646684370935, "2": 0.000956461881287396, "3": 0.0008284291252493858, "4": 0.0007306385086849332, "5": 0.000653502473141998, "6": 0.000591098505537957, "7": 0.0005395736661739647, "8": 0.9893831014633179, "9": 0.0004594715137500316, "10": 0.0004277228144928813, "11": 0.0004000781336799264, "12": 0.0003757899103220552, "13": 0.00035428194678388536, "14": 0.0003351026971358806, "15": 0.00031789334025233984, "16": 0.0003023652534466237, "17": 0.00028828351059928536, "18": 0.0002754550368990749, "19": 0.000263719615759328}}, {"key": "nguyen2020suggesting", "year": "2020", "title": "Suggesting Natural Method Names to Check Name Consistencies", "topic_distr": {"0": 0.0010064969537779689, "1": 0.0008212410029955208, "2": 0.0006942388135939837, "3": 0.0006012957892380655, "4": 0.0005303177167661488, "5": 0.06886360049247742, "6": 0.00042903539724648, "7": 0.0003916372952517122, "8": 0.07174935191869736, "9": 0.12621402740478516, "10": 0.0003104528586845845, "11": 0.0002903876011259854, "12": 0.00027275856700725853, "13": 0.6424854397773743, "14": 0.0002432266774121672, "15": 0.00023073564807418734, "16": 0.00021946495689917356, "17": 0.00020924402633681893, "18": 0.00019993276509921998, "19": 0.08423712849617004}}, {"key": "nie2021evaluation", "year": "2021", "title": "Impact of Evaluation Methodologies on Code Summarization", "topic_distr": {"0": 0.32918089628219604, "1": 0.0016164706321433187, "2": 0.13921460509300232, "3": 0.0011834782781079412, "4": 0.0010437805904075503, "5": 0.0009335840004496276, "6": 0.000844434427563101, "7": 0.0007708269404247403, "8": 0.0007090230355970562, "9": 0.0006563941715285182, "10": 0.0006110384711064398, "11": 0.0005715456791222095, "12": 0.0005368479178287089, "13": 0.02417650818824768, "14": 0.00047872273717075586, "15": 0.0004541377129498869, "16": 0.00043195454054512084, "17": 0.00041183753637596965, "18": 0.00039351097075268626, "19": 0.49578043818473816}}, {"key": "nijkamp2022conversational", "year": "2022", "title": "A Conversational Paradigm for Program Synthesis", "topic_distr": {"0": 0.1958245486021042, "1": 0.001289086532779038, "2": 0.001089679659344256, "3": 0.0009437915286980569, "4": 0.0008323914953507483, "5": 0.0007445118972100317, "6": 0.020250890403985977, "7": 0.029512418434023857, "8": 0.0005654295673593879, "9": 0.0005234592244960368, "10": 0.00048728910041972995, "11": 0.0004557945067062974, "12": 0.00042812383617274463, "13": 0.0004036205937154591, "14": 0.00038177036913111806, "15": 0.0003621643700171262, "16": 0.00034447378129698336, "17": 0.0003284309641458094, "18": 0.5953028798103333, "19": 0.14992927014827728}}, {"key": "nijkamp2023codegen2", "year": "2023", "title": "CodeGen2: Lessons for Training LLMs on Programming and Natural Languages", "topic_distr": {"0": 0.7747047543525696, "1": 0.08633272349834442, "2": 0.0013450824189931154, "3": 0.0011650086380541325, "4": 0.0010274738306179643, "5": 0.0009189985576085746, "6": 0.0008312418940477073, "7": 0.000758784357458353, "8": 0.0006979460013099015, "9": 0.0006461393204517663, "10": 0.0006014921818859875, "11": 0.0005626164493151009, "12": 0.0005284607177600265, "13": 0.0004982147947885096, "14": 0.0004712436639238149, "15": 0.00044704272295348346, "16": 0.00042520611896179616, "17": 0.00040540340705774724, "18": 0.12726131081581116, "19": 0.0003708600124809891}}, {"key": "nitin2021direct", "year": "2021", "title": "DIRECT : A Transformer-based Model for Decompiled Identifier Renaming", "topic_distr": {"0": 0.5535016059875488, "1": 0.0021666185930371284, "2": 0.0018316124333068728, "3": 0.0015864076558500528, "4": 0.0013991433661431074, "5": 0.001251429202966392, "6": 0.0011319281766191125, "7": 0.0010332603706046939, "8": 0.0009504149202257395, "9": 0.0008798681665211916, "10": 0.0008190707885660231, "11": 0.0007661323761567473, "12": 0.0007196215447038412, "13": 0.2626896798610687, "14": 0.16650094091892242, "15": 0.0006087520741857588, "16": 0.0005790164577774704, "17": 0.0005520505364984274, "18": 0.0005274845170788467, "19": 0.0005050117033533752}}, {"key": "niu2022spt-code", "year": "2022", "title": "SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations", "topic_distr": {"0": 0.1698514074087143, "1": 0.0013950689462944865, "2": 0.0011792732402682304, "3": 0.16118507087230682, "4": 0.45891642570495605, "5": 0.000805718416813761, "6": 0.0007287790649570525, "7": 0.0006652529700659215, "8": 0.0006119137979112566, "9": 0.0005664931377395988, "10": 0.0005273494170978665, "11": 0.0004932656302116811, "12": 0.0004633201169781387, "13": 0.00043680245289579034, "14": 0.00041315591079182923, "15": 0.00039193808333948255, "16": 0.0003727931762114167, "17": 0.00035543146077543497, "18": 0.0003396149259060621, "19": 0.2003008872270584}}, {"key": "nye2021program", "year": "2021", "title": "Program Synthesis with Large Language Models", "topic_distr": {"0": 0.3778062164783478, "1": 0.001039203256368637, "2": 0.0008784011006355286, "3": 0.0007608169689774513, "4": 0.12918996810913086, "5": 0.0006001702859066427, "6": 0.0005428590229712427, "7": 0.0004955391632393003, "8": 0.0004558074870146811, "9": 0.15237031877040863, "10": 0.0003928164078388363, "11": 0.00036742782685905695, "12": 0.00034512177808210254, "13": 0.00032536903745494783, "14": 0.00030775502091273665, "15": 0.0002919501275755465, "16": 0.00027768927975557745, "17": 0.0002647567307576537, "18": 0.19144555926322937, "19": 0.14184223115444183}}, {"key": "nye2021show", "year": "2021", "title": "Show Your Work: Scratchpads for Intermediate Computation with Language Models", "topic_distr": {"0": 0.002970197005197406, "1": 0.002424320438876748, "2": 0.0020495483186095953, "3": 0.0017751922132447362, "4": 0.0015656574396416545, "5": 0.00140036316588521, "6": 0.001266640261746943, "7": 0.001156229991465807, "8": 0.0010635248618200421, "9": 0.0009845823515206575, "10": 0.0009165493538603187, "11": 0.0008573107188567519, "12": 0.0008052645134739578, "13": 0.0007591759786009789, "14": 0.0007180775864981115, "15": 0.0006812003557570279, "16": 0.9768330454826355, "17": 0.0006177506875246763, "18": 0.0005902610719203949, "19": 0.0005651137325912714}}, {"key": "oda2015learning", "year": "2015", "title": "Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation", "topic_distr": {"0": 0.002352764131501317, "1": 0.0019214536296203732, "2": 0.0016242428682744503, "3": 0.0014068009331822395, "4": 0.0012407460017129779, "5": 0.0011097542010247707, "6": 0.001003782032057643, "7": 0.0009162845090031624, "8": 0.0008428180008195341, "9": 0.000780257920268923, "10": 0.0007263433653861284, "11": 0.000679398188367486, "12": 0.0006381528219208121, "13": 0.0006016287952661514, "14": 0.38741418719291687, "15": 0.0005398349603638053, "16": 0.0005134657840244472, "17": 0.5157145857810974, "18": 0.00046776776434853673, "19": 0.07950571924448013}}, {"key": "oh2015learning", "year": "2015", "title": "Learning a Strategy for Adapting a Program Analysis via Bayesian Optimisation", "topic_distr": {"0": 0.06681329756975174, "1": 0.0013767164200544357, "2": 0.001163322594948113, "3": 0.0010075668105855584, "4": 0.0008886345312930644, "5": 0.4176485538482666, "6": 0.0007189189782366157, "7": 0.0006562523194588721, "8": 0.283108115196228, "9": 0.0005588287021964788, "10": 0.0005202145548537374, "11": 0.00048659194726496935, "12": 0.0004570515884552151, "13": 0.0004308926872909069, "14": 0.00040756608359515667, "15": 0.0003866353363264352, "16": 0.00036774942418560386, "17": 0.0003506226057652384, "18": 0.2223317176103592, "19": 0.0003207469417247921}}, {"key": "olausson2023demystifying", "year": "2023", "title": "Demystifying GPT Self-Repair for Code Generation", "topic_distr": {"0": 0.5331573486328125, "1": 0.0018182933563366532, "2": 0.001537216012366116, "3": 0.0013314178213477135, "4": 0.23126927018165588, "5": 0.0010502879740670323, "6": 0.0009499942534603179, "7": 0.0008671853574924171, "8": 0.0007976555498316884, "9": 0.0007384477648884058, "10": 0.0006874222308397293, "11": 0.0006429926143027842, "12": 0.0006039573927409947, "13": 0.0005693904822692275, "14": 0.0005385662079788744, "15": 0.0675869733095169, "16": 0.0004859516629949212, "17": 0.08499967306852341, "18": 0.06994415819644928, "19": 0.00042384161497466266}}, {"key": "omar2013structured", "year": "2013", "title": "Structured Statistical Syntax Tree Prediction", "topic_distr": {"0": 0.003041626187041402, "1": 0.0024835632648319006, "2": 0.002099522389471531, "3": 0.0018185105873271823, "4": 0.0016038608737289906, "5": 0.0014345347881317139, "6": 0.001297548646107316, "7": 0.0011844440596178174, "8": 0.32182714343070984, "9": 0.03521198779344559, "10": 0.19169577956199646, "11": 0.0008782306686043739, "12": 0.15983186662197113, "13": 0.2716778516769409, "14": 0.0007355999550782144, "15": 0.0006978228921070695, "16": 0.000663736485876143, "17": 0.0006328249583020806, "18": 0.0006046645576134324, "19": 0.0005789035349152982}}, {"key": "orlanski2021reading", "year": "2021", "title": "Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation", "topic_distr": {"0": 0.3502039909362793, "1": 0.002036559861153364, "2": 0.25201866030693054, "3": 0.0014911777107045054, "4": 0.0013151682214811444, "5": 0.0011763190850615501, "6": 0.13406428694725037, "7": 0.0009712448809295893, "8": 0.000893371703568846, "9": 0.0008270591497421265, "10": 0.0007699107518419623, "11": 0.0007201496628113091, "12": 0.000676430354360491, "13": 0.0006377155077643692, "14": 0.0006031924276612699, "15": 0.24956108629703522, "16": 0.0005442643305286765, "17": 0.0005189168732613325, "18": 0.0004958253121003509, "19": 0.00047470125718973577}}, {"key": "ott2018deep", "year": "2018", "title": "A Deep Learning Approach to Identifying Source Code in Images and Video", "topic_distr": {"0": 0.002774127060547471, "1": 0.2917691469192505, "2": 0.0019129804568365216, "3": 0.11299031972885132, "4": 0.001461312989704311, "5": 0.001307035330682993, "6": 0.29110220074653625, "7": 0.2880897521972656, "8": 0.0009926456259563565, "9": 0.0009189642150886357, "10": 0.0008554653613828123, "11": 0.0008001747191883624, "12": 0.0007515971665270627, "13": 0.0007085802499204874, "14": 0.0006702208775095642, "15": 0.0006358013488352299, "16": 0.0006047444767318666, "17": 0.0005765803507529199, "18": 0.000550922763068229, "19": 0.0005274513969197869}}, {"key": "pandi2020opttyper", "year": "2020", "title": "OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints", "topic_distr": {"0": 0.0014018386136740446, "1": 0.0011440556263551116, "2": 0.0009672293672338128, "3": 0.0008377330959774554, "4": 0.0007388480007648468, "5": 0.0006608450203202665, "6": 0.0005977398832328618, "7": 0.0005456361686810851, "8": 0.0005018877563998103, "9": 0.9892259240150452, "10": 0.00043252858449704945, "11": 0.00040457327850162983, "12": 0.0003800121776293963, "13": 0.0003582625649869442, "14": 0.0003388678014744073, "15": 0.0003214651078451425, "16": 0.0003057625435758382, "17": 0.0002915225923061371, "18": 0.00027854996733367443, "19": 0.00026668267673812807}}, {"key": "panthaplackel2020associating", "year": "2020", "title": "Associating Natural Language Comment and Source Code Entities", "topic_distr": {"0": 0.0027122609317302704, "1": 0.0022137188352644444, "2": 0.0018713785102590919, "3": 0.0016208529705181718, "4": 0.001429527415893972, "5": 0.0012786051956936717, "6": 0.2924180030822754, "7": 0.0010556987253949046, "8": 0.0009710540762171149, "9": 0.0008989753550849855, "10": 0.0008368577109649777, "11": 0.000782769697252661, "12": 0.0007352488464675844, "13": 0.0006931675598025322, "14": 0.0006556425360031426, "15": 0.0006219717324711382, "16": 0.0005915903602726758, "17": 0.0005640388699248433, "18": 0.12421276420354843, "19": 0.563835859298706}}, {"key": "panthaplackel2020copy", "year": "2020", "title": "Copy that! Editing Sequences by Copying Spans", "topic_distr": {"0": 0.0032825605012476444, "1": 0.22368736565113068, "2": 0.0022653492633253336, "3": 0.0019620773382484913, "4": 0.11978323012590408, "5": 0.0015477855922654271, "6": 0.11436066031455994, "7": 0.0012779512908309698, "8": 0.23380236327648163, "9": 0.08506682515144348, "10": 0.0010130384471267462, "11": 0.0009475635015405715, "12": 0.0008900382090359926, "13": 0.0008390977163799107, "14": 0.0007936726906336844, "15": 0.0007529132999479771, "16": 0.0007161358371376991, "17": 0.0006827840115875006, "18": 0.0006524004274979234, "19": 0.2056761533021927}}, {"key": "panthaplackel2020deep", "year": "2020", "title": "Deep Just-In-Time Inconsistency Detection Between Comments and Source Code", "topic_distr": {"0": 0.0020450465381145477, "1": 0.0016693805810064077, "2": 0.0014112620847299695, "3": 0.0012223201338201761, "4": 0.001078029745258391, "5": 0.15440873801708221, "6": 0.0008721432532183826, "7": 0.0007961204391904175, "8": 0.0007322885212488472, "9": 0.0006779326940886676, "10": 0.49772223830223083, "11": 0.08883847296237946, "12": 0.0005544637097045779, "13": 0.0005227295332588255, "14": 0.05975695699453354, "15": 0.0004690395144280046, "16": 0.0004461284261196852, "17": 0.00042535134707577527, "18": 0.0004064234090037644, "19": 0.18594492971897125}}, {"key": "panthaplackel2020learning", "year": "2020", "title": "Learning to Update Natural Language Comments Based on Code Changes", "topic_distr": {"0": 0.002654894720762968, "1": 0.002166753401979804, "2": 0.0018316172063350677, "3": 0.03665255382657051, "4": 0.0013991175219416618, "5": 0.0012514061527326703, "6": 0.07358219474554062, "7": 0.001033241394907236, "8": 0.0009503974579274654, "9": 0.0008798520429991186, "10": 0.0008190557127818465, "11": 0.15527579188346863, "12": 0.0007196083315648139, "13": 0.0006784222205169499, "14": 0.0006416954565793276, "15": 0.1029374822974205, "16": 0.0005790058639831841, "17": 0.0005520404083654284, "18": 0.0005274748546071351, "19": 0.6148673892021179}}, {"key": "panthaplackel2021learning", "year": "2021", "title": "Learning to Describe Solutions for Bug Reports Based on Developer Discussions", "topic_distr": {"0": 0.0017823163652792573, "1": 0.0014547088649123907, "2": 0.0012297645444050431, "3": 0.0010651389602571726, "4": 0.0009394151857122779, "5": 0.23615901172161102, "6": 0.0007600013632327318, "7": 0.0006937536527402699, "8": 0.000638129364233464, "9": 0.0005907627637498081, "10": 0.12216295301914215, "11": 0.0005143980961292982, "12": 0.0004831696569453925, "13": 0.0004555158957373351, "14": 0.000430856307502836, "15": 0.00040872948011383414, "16": 0.0003887643397320062, "17": 0.0003706588177010417, "18": 0.0003541646874509752, "19": 0.6291177868843079}}, {"key": "panthaplackel2022using", "year": "2022", "title": "Using Developer Discussions to Guide Fixing Bugs in Software", "topic_distr": {"0": 0.002547771902754903, "1": 0.002078237012028694, "2": 0.001756821060553193, "3": 0.0015216267202049494, "4": 0.0013420216273516417, "5": 0.0012003370793536305, "6": 0.18777261674404144, "7": 0.000991075299680233, "8": 0.0009116121218539774, "9": 0.0008439455996267498, "10": 0.0007856303709559143, "11": 0.1974731832742691, "12": 0.0006902414024807513, "13": 0.0006507360958494246, "14": 0.0006155081209726632, "15": 0.2406657338142395, "16": 0.0005553768132813275, "17": 0.0005295118317008018, "18": 0.0005059487884864211, "19": 0.356562077999115}}, {"key": "parisi2021source", "year": "2021", "title": "Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers", "topic_distr": {"0": 0.0024954271502792835, "1": 0.28104713559150696, "2": 0.0017217242857441306, "3": 0.001491209608502686, "4": 0.0013151814928278327, "5": 0.26290038228034973, "6": 0.0010640027467161417, "7": 0.0009712558821775019, "8": 0.0008933818317018449, "9": 0.0008270685211755335, "10": 0.0007699194829910994, "11": 0.0007201578700914979, "12": 0.0006764380377717316, "13": 0.0006377227837219834, "14": 0.22639524936676025, "15": 0.17849133908748627, "16": 0.0005442705005407333, "17": 0.03606756031513214, "18": 0.0004958309582434595, "19": 0.0004747066705022007}}, {"key": "parisi2022making", "year": "2022", "title": "Making the Most of Scarce Input Data in Deep Learning-Based Source Code Classification for Heterogeneous Device Mapping", "topic_distr": {"0": 0.11886831372976303, "1": 0.22696246206760406, "2": 0.1368444859981537, "3": 0.0009320048848167062, "4": 0.0008219860028475523, "5": 0.16773676872253418, "6": 0.16601701080799103, "7": 0.0006070330273360014, "8": 0.0005583619349636137, "9": 0.0005169162177480757, "10": 0.00048119816347025335, "11": 0.00045009725727140903, "12": 0.0004227724566590041, "13": 0.0003985754738096148, "14": 0.0003769983886741102, "15": 0.000357637443812564, "16": 0.0003401680151000619, "17": 0.1767006516456604, "18": 0.00030989336664788425, "19": 0.00029669073410332203}}, {"key": "parvez2018building", "year": "2018", "title": "Building Language Models for Text with Named Entities", "topic_distr": {"0": 0.23787051439285278, "1": 0.0024248696863651276, "2": 0.33783426880836487, "3": 0.001775230746716261, "4": 0.14775340259075165, "5": 0.001400386798195541, "6": 0.09912656247615814, "7": 0.001156249432824552, "8": 0.0010635427897796035, "9": 0.08475503325462341, "10": 0.0009165647206827998, "11": 0.0008573251543566585, "12": 0.0008052780758589506, "13": 0.07844039797782898, "14": 0.0007180896354839206, "15": 0.0006812118226662278, "16": 0.0006479367730207741, "17": 0.0006177611066959798, "18": 0.0005902710254304111, "19": 0.0005651232204400003}}, {"key": "parvez2021retrieval", "year": "2021", "title": "Retrieval Augmented Code Generation and Summarization", "topic_distr": {"0": 0.002190434606745839, "1": 0.32397526502609253, "2": 0.0015102762263268232, "3": 0.0013080746866762638, "4": 0.0011536694364622235, "5": 0.001031871302984655, "6": 0.0009333357447758317, "7": 0.0008519788971170783, "8": 0.0007836683071218431, "9": 0.0007254987140186131, "10": 0.0006753680063411593, "11": 0.0006317174411378801, "12": 0.000593366741668433, "13": 0.0005594059475697577, "14": 0.0005291221896186471, "15": 0.0005019488744437695, "16": 0.00047743029426783323, "17": 0.00045519540435634553, "18": 0.00043493942939676344, "19": 0.660677433013916}}, {"key": "pashakhanloo2022codetrek", "year": "2022", "title": "CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation", "topic_distr": {"0": 0.002353803487494588, "1": 0.05509020760655403, "2": 0.0016241994453594089, "3": 0.19118911027908325, "4": 0.10875101387500763, "5": 0.0011097377864643931, "6": 0.001003767130896449, "7": 0.0009162709466181695, "8": 0.0008428054861724377, "9": 0.46950316429138184, "10": 0.0007263325969688594, "11": 0.000679388118442148, "12": 0.16258111596107483, "13": 0.0006016198312863708, "14": 0.0005690508405677974, "15": 0.0005398269859142601, "16": 0.0005134581588208675, "17": 0.0004895454039797187, "18": 0.00046776083763688803, "19": 0.00044783245539292693}}, {"key": "patil2022exploring", "year": "2022", "title": "Exploring Dimensions of Generalizability and Few-shot Transfer for Text-to-SQL Semantic Parsing", "topic_distr": {"0": 0.5296627879142761, "1": 0.031336743384599686, "2": 0.001871358253993094, "3": 0.0016208314336836338, "4": 0.0014295057626441121, "5": 0.0012785873841494322, "6": 0.0011564932065084577, "7": 0.0010556841734796762, "8": 0.42315295338630676, "9": 0.0008989630732685328, "10": 0.0008368462440557778, "11": 0.0007827589870430529, "12": 0.0007352387765422463, "13": 0.0006931580719538033, "14": 0.0006556335720233619, "15": 0.0006219632341526449, "16": 0.0005915822694078088, "17": 0.0005640311283059418, "18": 0.0005389319849200547, "19": 0.0005159714492037892}}, {"key": "patra2016learning", "year": "2016", "title": "Learning to Fuzz: Application-Independent Fuzz Testing with Probabilistic, Generative Models of Input Data", "topic_distr": {"0": 0.0013268737820908427, "1": 0.0010833840351551771, "2": 0.0009158079046756029, "3": 0.000793196726590395, "4": 0.12365182489156723, "5": 0.15926063060760498, "6": 0.36407992243766785, "7": 0.000516625470481813, "8": 0.00047520309453830123, "9": 0.12988409399986267, "10": 0.0004095316107850522, "11": 0.00038306269561871886, "12": 0.00035980745451524854, "13": 0.0003392142243683338, "14": 0.0003208506677765399, "15": 0.00030437324312515557, "16": 0.0002895055804401636, "17": 0.2150897979736328, "18": 0.00026373984292149544, "19": 0.00025250352337025106}}, {"key": "patra2021semantic", "year": "2021", "title": "A Semantic Bug Seeding: A Learning-Based Approach for Creating Realistic Bugs", "topic_distr": {"0": 0.0012230046559125185, "1": 0.0009983992204070091, "2": 0.0008439468219876289, "3": 0.05347505211830139, "4": 0.0006446921615861356, "5": 0.5344248414039612, "6": 0.0005215664277784526, "7": 0.0004761026066262275, "8": 0.1814476102590561, "9": 0.0004054229939356446, "10": 0.07493152469396591, "11": 0.0003530161629896611, "12": 0.000331585033563897, "13": 0.00031260709511116147, "14": 0.0002956839161925018, "15": 0.14831803739070892, "16": 0.0002667974622454494, "17": 0.0002543721639085561, "18": 0.00024305273836944252, "19": 0.0002326977701159194}}, {"key": "pearce2021empirical", "year": "2021", "title": "An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions", "topic_distr": {"0": 0.20914937555789948, "1": 0.002314523095265031, "2": 0.001956396037712693, "3": 0.0016945265233516693, "4": 0.001494507770985365, "5": 0.16057346761226654, "6": 0.0012090789387002587, "7": 0.0011036861687898636, "8": 0.0010151941096410155, "9": 0.0009398389374837279, "10": 0.0008748976397328079, "11": 0.320639044046402, "12": 0.0007686701137572527, "13": 0.0007246760069392622, "14": 0.0006854452658444643, "15": 0.0006502438918687403, "16": 0.0006184815429151058, "17": 0.0005896776565350592, "18": 0.29245883226394653, "19": 0.0005394326872192323}}, {"key": "peng2021how", "year": "2021", "title": "How could Neural Networks understand Programs?", "topic_distr": {"0": 0.001732712029479444, "1": 0.1726647913455963, "2": 0.001195593853481114, "3": 0.12979958951473236, "4": 0.06605451554059982, "5": 0.0008168928907252848, "6": 0.0007388863596133888, "7": 0.0006744792335666716, "8": 0.0006204003584571183, "9": 0.040793612599372864, "10": 0.0005346630932763219, "11": 0.0005001066019758582, "12": 0.35290905833244324, "13": 0.00044286035699769855, "14": 0.00041888587293215096, "15": 0.0003973738057538867, "16": 0.00037796335527673364, "17": 0.00036036086385138333, "18": 0.22863762080669403, "19": 0.0003296554205007851}}, {"key": "peng2023generative", "year": "2023", "title": "Generative Type Inference for Python", "topic_distr": {"0": 0.0011779435444623232, "1": 0.0009607580723240972, "2": 0.0008121157297864556, "3": 0.0007033958099782467, "4": 0.0006203679949976504, "5": 0.0005548730841837823, "6": 0.0005018873489461839, "7": 0.0004581389366649091, "8": 0.0004214059445075691, "9": 0.9909526109695435, "10": 0.0003631690633483231, "11": 0.0003396966203581542, "12": 0.0003190741117578, "13": 0.00030081221484579146, "14": 0.0002845275739673525, "15": 0.00026991553022526205, "16": 0.0002567310002632439, "17": 0.0002447745355311781, "18": 0.00023388217960018665, "19": 0.00022391791571862996}}, {"key": "phan2021cotext", "year": "2021", "title": "CoTexT: Multi-task Learning with Code-Text Transformer", "topic_distr": {"0": 0.0027727934066206217, "1": 0.0022628046572208405, "2": 0.001912882667966187, "3": 0.0016568565042689443, "4": 0.9792333841323853, "5": 0.0013070020359009504, "6": 0.0011821931693702936, "7": 0.0010791439563035965, "8": 0.0009926195489242673, "9": 0.0009189400589093566, "10": 0.0008554428350180387, "11": 0.0008001537062227726, "12": 0.000751577434130013, "13": 0.0007085616234689951, "14": 0.0006702032405883074, "15": 0.0006357846432365477, "16": 0.0006047285860404372, "17": 0.0005765651585534215, "18": 0.0005509082693606615, "19": 0.0005274375434964895}}, {"key": "piech2015learning", "year": "2015", "title": "Learning Program Embeddings to Propagate Feedback on Student Code", "topic_distr": {"0": 0.00415824493393302, "1": 0.3120197355747223, "2": 0.0028695324435830116, "3": 0.002485383301973343, "4": 0.0021919794380664825, "5": 0.001960562076419592, "6": 0.0017733448185026646, "7": 0.23437754809856415, "8": 0.0014889755984768271, "9": 0.0013784529874101281, "10": 0.0012832041829824448, "11": 0.0012002678122371435, "12": 0.0011274011339992285, "13": 0.0010628754971548915, "14": 0.0010053360601887107, "15": 0.04961565136909485, "16": 0.17008090019226074, "17": 0.0008648746297694743, "18": 0.2082645148038864, "19": 0.0007911808788776398}}, {"key": "poesia2022synchromesh", "year": "2022", "title": "Synchromesh: Reliable code generation from pre-trained language models", "topic_distr": {"0": 0.0017323099309578538, "1": 0.10084947943687439, "2": 0.0757145956158638, "3": 0.09722290188074112, "4": 0.21723701059818268, "5": 0.0008168946951627731, "6": 0.0007388880476355553, "7": 0.04150799661874771, "8": 0.0006204017554409802, "9": 0.000574351055547595, "10": 0.0005346643156372011, "11": 0.0005001078243367374, "12": 0.000469746912131086, "13": 0.000442861404735595, "14": 0.0004188868624623865, "15": 0.0003973747370764613, "16": 0.00037796422839164734, "17": 0.0003603617078624666, "18": 0.31389984488487244, "19": 0.145583376288414}}, {"key": "popov2021time", "year": "2021", "title": "Time-Efficient Code Completion Model for the R Programming Language", "topic_distr": {"0": 0.6631543040275574, "1": 0.002545572817325592, "2": 0.002152049448341131, "3": 0.0018639673944562674, "4": 0.05139801278710365, "5": 0.0014703880297020078, "6": 0.0013299781130626798, "7": 0.0012140467297285795, "8": 0.001116706058382988, "9": 0.001033815904520452, "10": 0.26616692543029785, "11": 0.0009001801954582334, "12": 0.0008455314673483372, "13": 0.0007971382583491504, "14": 0.0007539847283624113, "15": 0.0007152635371312499, "16": 0.000680325145367533, "17": 0.0006486410857178271, "18": 0.0006197768379934132, "19": 0.0005933720385655761}}, {"key": "pradel2017deep", "year": "2017", "title": "Deep Learning to Find Bugs", "topic_distr": {"0": 0.0011775675229728222, "1": 0.0009606918320059776, "2": 0.0008121214341372252, "3": 0.0007033946458250284, "4": 0.0006203665398061275, "5": 0.6621314287185669, "6": 0.000501885951962322, "7": 0.00045813765609636903, "8": 0.00042140475125052035, "9": 0.3293764889240265, "10": 0.0003631680447142571, "11": 0.00033969568903557956, "12": 0.0003190732095390558, "13": 0.0003008113708347082, "14": 0.0002845267590600997, "15": 0.00026991477352567017, "16": 0.00025673030177131295, "17": 0.00024477383703924716, "18": 0.00023388152476400137, "19": 0.00022391728998627514}}, {"key": "pradel2019typewriter", "year": "2019", "title": "TypeWriter: Neural Type Prediction with Search-based Validation", "topic_distr": {"0": 0.0013270333874970675, "1": 0.0010838019661605358, "2": 0.0009157542954199016, "3": 0.0007931827567517757, "4": 0.0006995554431341588, "5": 0.0006257000495679677, "6": 0.0005659508751705289, "7": 0.0005166181363165379, "8": 0.0004751963715534657, "9": 0.9897986650466919, "10": 0.00040952584822662175, "11": 0.0003830572823062539, "12": 0.00035980239044874907, "13": 0.00033920942223630846, "14": 0.00032084615668281913, "15": 0.00030436893575824797, "16": 0.0002895014768000692, "17": 0.00027601883630268276, "18": 0.000263736117631197, "19": 0.00025249997270293534}}, {"key": "pradel2020neural", "year": "2020", "title": "Neural Software Analysis", "topic_distr": {"0": 0.001620053662918508, "1": 0.20882710814476013, "2": 0.0011179593857377768, "3": 0.0009683148236945271, "4": 0.0008540163980796933, "5": 0.6245821118354797, "6": 0.0006909125950187445, "7": 0.0006306872237473726, "8": 0.0005801194929517806, "9": 0.05685010552406311, "10": 0.09987374395132065, "11": 0.0004676361277233809, "12": 0.0004392465634737164, "13": 0.00041410670382902026, "14": 0.0003916888381354511, "15": 0.00037157346378080547, "16": 0.00035342329647392035, "17": 0.00033696365426294506, "18": 0.000321968924254179, "19": 0.00030825185240246356}}, {"key": "pravilov2021unsupervised", "year": "2021", "title": "Unsupervised Learning of General-Purpose Embeddings for Code Changes", "topic_distr": {"0": 0.0023996445816010237, "1": 0.001958355540409684, "2": 0.19245611131191254, "3": 0.1402883380651474, "4": 0.18767258524894714, "5": 0.0011310771806165576, "6": 0.001023068674840033, "7": 0.0009338900563307106, "8": 0.0008590119541622698, "9": 0.0007952497689984739, "10": 0.0007402993505820632, "11": 0.3916054964065552, "12": 0.000650414323899895, "13": 0.0006131884874776006, "14": 0.0005799931823275983, "15": 0.0005502073909156024, "16": 0.0005233315168879926, "17": 0.0004989589215256274, "18": 0.00047675546375103295, "19": 0.07424403727054596}}, {"key": "proksch2015intelligent", "year": "2015", "title": "Intelligent Code Completion with Bayesian Networks", "topic_distr": {"0": 0.0014184555038809776, "1": 0.4171982407569885, "2": 0.0009782512206584215, "3": 0.0008472715853713453, "4": 0.0007472615106962621, "5": 0.0006683696410618722, "6": 0.0006045459886081517, "7": 0.029774049296975136, "8": 0.0005076024681329727, "9": 0.06605321168899536, "10": 0.23282268643379211, "11": 0.000409179920097813, "12": 0.0003843391314148903, "13": 0.00036234184517525136, "14": 0.00034272627090103924, "15": 0.0003251254092901945, "16": 0.24571006000041962, "17": 0.00029484197148121893, "18": 0.00028172164456918836, "19": 0.0002697192248888314}}, {"key": "pu2016skp", "year": "2016", "title": "sk_p: a neural program corrector for MOOCs", "topic_distr": {"0": 0.0025451150722801685, "1": 0.002078576013445854, "2": 0.0017568451585248113, "3": 0.0015216282336041331, "4": 0.0013420111499726772, "5": 0.0012003284646198153, "6": 0.0010857071029022336, "7": 0.18828997015953064, "8": 0.217152401804924, "9": 0.0008439397206529975, "10": 0.0007856248994357884, "11": 0.000734848203137517, "12": 0.10997644811868668, "13": 0.0006507314974442124, "14": 0.11248625069856644, "15": 0.16038744151592255, "16": 0.0005553729715757072, "17": 0.0005295081064105034, "18": 0.19559288024902344, "19": 0.0004843900678679347}}, {"key": "puri2021project", "year": "2021", "title": "Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks", "topic_distr": {"0": 0.8101667165756226, "1": 0.14784401655197144, "2": 0.0010897023603320122, "3": 0.0009438074775971472, "4": 0.000832391669973731, "5": 0.0007445115479640663, "6": 0.0006734169437550008, "7": 0.0006147166131995618, "8": 0.000565429450944066, "9": 0.0005234591662883759, "10": 0.0004872889840044081, "11": 0.00045579441939480603, "12": 0.00042812374886125326, "13": 0.00040362050640396774, "14": 0.03257773444056511, "15": 0.00036216428270563483, "16": 0.00034447372308932245, "17": 0.0003284309059381485, "18": 0.000313815864501521, "19": 0.00030044614686630666}}, {"key": "rabin2019testing", "year": "2019", "title": "Testing Neural Program Analyzers", "topic_distr": {"0": 0.18774564564228058, "1": 0.32941359281539917, "2": 0.0017933635972440243, "3": 0.0015533139230683446, "4": 0.0013699628179892898, "5": 0.37617290019989014, "6": 0.001108321244828403, "7": 0.0010117113124579191, "8": 0.0009305935818701982, "9": 0.0008615181432105601, "10": 0.0008019887027330697, "11": 0.0007501543732360005, "12": 0.0007046135142445564, "13": 0.0006642856169492006, "14": 0.0006283241673372686, "15": 0.0005960562848486006, "16": 0.0005669408128596842, "17": 0.0005405372940003872, "18": 0.09229163825511932, "19": 0.0004944794345647097}}, {"key": "rabin2020demystifying", "year": "2020", "title": "Towards Demystifying Dimensions of Source Code Embeddings", "topic_distr": {"0": 0.13083767890930176, "1": 0.001616664812900126, "2": 0.0013665134320035577, "3": 0.6253684759140015, "4": 0.001043796306475997, "5": 0.0009335973882116377, "6": 0.0008444465347565711, "7": 0.000770837941672653, "8": 0.0007090331637300551, "9": 0.0006564035429619253, "10": 0.0006110471440479159, "11": 0.0005715538281947374, "12": 0.07996419072151184, "13": 0.000506129115819931, "14": 0.00047872954746708274, "15": 0.0004541441740002483, "16": 0.0004319606814533472, "17": 0.0004118434153497219, "18": 0.15204620361328125, "19": 0.00037675126804970205}}, {"key": "rabin2021generalizability", "year": "2021", "title": "On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations", "topic_distr": {"0": 0.08938826620578766, "1": 0.17066149413585663, "2": 0.1176242083311081, "3": 0.0006161977653391659, "4": 0.000543461472261697, "5": 0.0004860858025494963, "6": 0.00043966862722299993, "7": 0.00040134365553967655, "8": 0.10163405537605286, "9": 0.0003417623520363122, "10": 0.000318147154757753, "11": 0.0002975846000481397, "12": 0.2736622393131256, "13": 0.02830488421022892, "14": 0.00024925480829551816, "15": 0.00023645421606488526, "16": 0.00022490418632514775, "17": 0.00021442995057441294, "18": 0.21415942907333374, "19": 0.000196158915059641}}, {"key": "rabin2021understanding", "year": "2021", "title": "Understanding Neural Code Intelligence Through Program Simplification", "topic_distr": {"0": 0.001371242688037455, "1": 0.7268484830856323, "2": 0.0009459748980589211, "3": 0.06624115258455276, "4": 0.05490439012646675, "5": 0.000646330532617867, "6": 0.000584611261729151, "7": 0.000533651967998594, "8": 0.0004908644477836788, "9": 0.00045442889677360654, "10": 0.00042302862857468426, "11": 0.00039568732609041035, "12": 0.00037166569381952286, "13": 0.0003503937623463571, "14": 0.00033142499160021544, "15": 0.0003144045185763389, "16": 0.00029904686380177736, "17": 0.0002851196622941643, "18": 0.14394725859165192, "19": 0.0002608253271318972}}, {"key": "rabin2022memorization", "year": "2022", "title": "Memorization and Generalization in Neural Code Intelligence Models", "topic_distr": {"0": 0.7157166600227356, "1": 0.23109976947307587, "2": 0.0009157719323411584, "3": 0.0007931820582598448, "4": 0.0006995555595494807, "5": 0.0006257007480598986, "6": 0.0005659515154547989, "7": 0.0005166187766008079, "8": 0.045428317040205, "9": 0.0004399243334773928, "10": 0.00040952631388790905, "11": 0.00038305771886371076, "12": 0.00035980279790237546, "13": 0.00033920982968993485, "14": 0.0003208465059287846, "15": 0.00030436928500421345, "16": 0.0002895018260460347, "17": 0.0002760191564448178, "18": 0.00026373640866950154, "19": 0.0002525002637412399}}, {"key": "rabin2022understanding", "year": "2022", "title": "Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models", "topic_distr": {"0": 0.002047714777290821, "1": 0.6990546584129333, "2": 0.0014111710479483008, "3": 0.0012222804361954331, "4": 0.0010780052980408072, "5": 0.0009641940123401582, "6": 0.0008721215417608619, "7": 0.0007961005321703851, "8": 0.0007322702440433204, "9": 0.025052225217223167, "10": 0.0006310729659162462, "11": 0.0005902852863073349, "12": 0.0005544498562812805, "13": 0.00052271643653512, "14": 0.0004944188985973597, "15": 0.000469027814688161, "16": 0.0004461173084564507, "17": 0.00042534072417765856, "18": 0.2622467279434204, "19": 0.0003890985099133104}}, {"key": "rabinovich2017abstract", "year": "2017", "title": "Abstract Syntax Networks for Code Generation and Semantic Parsing", "topic_distr": {"0": 0.002773512387648225, "1": 0.0022629699669778347, "2": 0.0019129366846755147, "3": 0.0016568780411034822, "4": 0.0014613016974180937, "5": 0.0013070246204733849, "6": 0.0011822148226201534, "7": 0.001079163746908307, "8": 0.0009926378261297941, "9": 0.0009189569973386824, "10": 0.0008554586092941463, "11": 0.0008001683745533228, "12": 0.6540197730064392, "13": 0.0007085746619850397, "14": 0.000670215580612421, "15": 0.0006357963429763913, "16": 0.0006047397037036717, "17": 0.0005765758105553687, "18": 0.0005509184557013214, "19": 0.3250301480293274}}, {"key": "raghothaman2018user", "year": "2018", "title": "User-guided program reasoning using Bayesian inference", "topic_distr": {"0": 0.002547903684899211, "1": 0.0020780968479812145, "2": 0.001756832585670054, "3": 0.0015216318424791098, "4": 0.0013420244213193655, "5": 0.5159039497375488, "6": 0.001085719559341669, "7": 0.0009910797234624624, "8": 0.0009116161963902414, "9": 0.12092799693346024, "10": 0.00078563392162323, "11": 0.0007348566432483494, "12": 0.0006902444874867797, "13": 0.0006507390062324703, "14": 0.000615510914940387, "15": 0.0005839010700583458, "16": 0.0005553793744184077, "17": 0.0005295142182148993, "18": 0.3453029990196228, "19": 0.0004843956558033824}}, {"key": "rahman2019natural", "year": "2019", "title": "Natural Software Revisited", "topic_distr": {"0": 0.39751866459846497, "1": 0.0014975180383771658, "2": 0.0012659403728321195, "3": 0.001096487743780017, "4": 0.0009670580038800836, "5": 0.0008649618248455226, "6": 0.21879874169826508, "7": 0.18566511571407318, "8": 0.000656906864605844, "9": 0.0006081464234739542, "10": 0.0005661245086230338, "11": 0.0005295346491038799, "12": 0.1871362179517746, "13": 0.0004689198103733361, "14": 0.00044353457633405924, "15": 0.0004207566671539098, "16": 0.0004002040368504822, "17": 0.00038156574009917676, "18": 0.0003645862452685833, "19": 0.00034905350184999406}}, {"key": "ramakrishnan2020backdoors", "year": "2022", "title": "Backdoors in Neural Models of Source Code", "topic_distr": {"0": 0.3873576819896698, "1": 0.32097193598747253, "2": 0.002326501067727804, "3": 0.08122970163822174, "4": 0.19332338869571686, "5": 0.0015896084951236844, "6": 0.0014378136256709695, "7": 0.001312482520006597, "8": 0.0012072493555024266, "9": 0.0011176384286955, "10": 0.0010404115309938788, "11": 0.0009731674217619002, "12": 0.0009140877518802881, "13": 0.0008617708226665854, "14": 0.0008151183719746768, "15": 0.0007732575759291649, "16": 0.0007354863919317722, "17": 0.0007012333371676505, "18": 0.0006700287922285497, "19": 0.0006414830568246543}}, {"key": "ray2015naturalness", "year": "2015", "title": "On the \u201cNaturalness\u201d of Buggy Code", "topic_distr": {"0": 0.002311105839908123, "1": 0.0018859596457332373, "2": 0.0015941648744046688, "3": 0.0013807397335767746, "4": 0.0012177592143416405, "5": 0.2672394812107086, "6": 0.0009851857321336865, "7": 0.0008993092342279851, "8": 0.0008272037957794964, "9": 0.0007658026879653335, "10": 0.3055136501789093, "11": 0.0006668115383945405, "12": 0.0006263302639126778, "13": 0.14005285501480103, "14": 0.0005585167673416436, "15": 0.27159208059310913, "16": 0.0005039531970396638, "17": 0.0004804830823559314, "18": 0.0004591018077917397, "19": 0.0004395423165988177}}, {"key": "raychev2014code", "year": "2014", "title": "Code Completion with Statistical Language Models", "topic_distr": {"0": 0.002446671947836876, "1": 0.001996759558096528, "2": 0.0016879678005352616, "3": 0.001461949199438095, "4": 0.0012893882812932134, "5": 0.09700777381658554, "6": 0.001043133088387549, "7": 0.12193023413419724, "8": 0.000875858822837472, "9": 0.000810846162494272, "10": 0.41167452931404114, "11": 0.0007060325006023049, "12": 0.0006631702417507768, "13": 0.20328710973262787, "14": 0.0005913680070079863, "15": 0.000560997985303402, "16": 0.0005335950409062207, "17": 0.000508744444232434, "18": 0.1504584550857544, "19": 0.0004653956275433302}}, {"key": "raychev2015predicting", "year": "2015", "title": "Predicting Program Properties from \u201cBig Code\u201d", "topic_distr": {"0": 0.0012231053551658988, "1": 0.000998396542854607, "2": 0.0008439730736427009, "3": 0.0007309927605092525, "4": 0.0006446972256526351, "5": 0.0005766330868937075, "6": 0.0005215693963691592, "7": 0.00047610531328246, "8": 0.00043793179793283343, "9": 0.6900396943092346, "10": 0.0003774111100938171, "11": 0.0003530181711539626, "12": 0.0003315869253128767, "13": 0.10228311270475388, "14": 0.0002956856042146683, "15": 0.0002805005351547152, "16": 0.0002667989756446332, "17": 0.00025437361910007894, "18": 0.1988316923379898, "19": 0.0002326990943402052}}, {"key": "raychev2016learning", "year": "2016", "title": "Learning Programs from Noisy Data", "topic_distr": {"0": 0.22759012877941132, "1": 0.17069140076637268, "2": 0.0007114291656762362, "3": 0.000616198405623436, "4": 0.000543463509529829, "5": 0.0004860876069869846, "6": 0.0004396702570375055, "7": 0.0770912915468216, "8": 0.14992466568946838, "9": 0.0003417636326048523, "10": 0.05238761752843857, "11": 0.00029758570599369705, "12": 0.00027951967786066234, "13": 0.10506882518529892, "14": 0.0002492557396180928, "15": 0.0002364551037317142, "16": 0.00022490501578431576, "17": 0.0002144307509297505, "18": 0.2124091237783432, "19": 0.00019615964265540242}}, {"key": "reid2022learning", "year": "2022", "title": "Learning to Model Editing Processes", "topic_distr": {"0": 0.0020793720614165068, "1": 0.001697374740615487, "2": 0.0014347145333886147, "3": 0.001242659636773169, "4": 0.10932736843824387, "5": 0.0009802712593227625, "6": 0.0008866635616868734, "7": 0.0008093749638646841, "8": 0.0007444802904501557, "9": 0.0006892195669934154, "10": 0.000641595630440861, "11": 0.0006001279107294977, "12": 0.0005636949208565056, "13": 0.0005314323934726417, "14": 0.2557799816131592, "15": 0.0004768485086970031, "16": 0.0004535559855867177, "17": 0.1679094135761261, "18": 0.0004131899040658027, "19": 0.45273861289024353}}, {"key": "ren2020codebleu", "year": "2020", "title": "CodeBLEU: a Method for Automatic Evaluation of Code Synthesis", "topic_distr": {"0": 0.2632967233657837, "1": 0.0014976030215620995, "2": 0.0012659579515457153, "3": 0.05405088886618614, "4": 0.0009670440922491252, "5": 0.000864949484821409, "6": 0.0007823540363460779, "7": 0.0007141579990275204, "8": 0.0006568977260030806, "9": 0.0006081379833631217, "10": 0.0005661166505888104, "11": 0.0005295272567309439, "12": 0.19190818071365356, "13": 0.00046891329111531377, "14": 0.05840080603957176, "15": 0.000420750817283988, "16": 0.000400198478018865, "17": 0.00038156044320203364, "18": 0.2940445840358734, "19": 0.12817469239234924}}, {"key": "richardson2017code2text", "year": "2017", "title": "The Code2Text Challenge: Text Generation in Source Code Libraries", "topic_distr": {"0": 0.002901797415688634, "1": 0.002368092304095626, "2": 0.0020020457450300455, "3": 0.0017339173937216401, "4": 0.1834055483341217, "5": 0.0013678012182936072, "6": 0.3737424314022064, "7": 0.0011293444549664855, "8": 0.0010387951042503119, "9": 0.0009616881143301725, "10": 0.0008952370844781399, "11": 0.0008373759337700903, "12": 0.0007865399238653481, "13": 0.000741523050237447, "14": 0.000701380311511457, "15": 0.0006653605960309505, "16": 0.0006328598828986287, "17": 0.0006033863173797727, "18": 0.0005765359383076429, "19": 0.42290830612182617}}, {"key": "richardson2017function", "year": "2017", "title": "Function Assistant: A Tool for NL Querying of APIs", "topic_distr": {"0": 0.0023112783674150705, "1": 0.0018862204160541296, "2": 0.001594117027707398, "3": 0.10377345234155655, "4": 0.001217760844156146, "5": 0.0010891960700973868, "6": 0.25518423318862915, "7": 0.5015157461166382, "8": 0.000827204727102071, "9": 0.0007658035610802472, "10": 0.0007128877914510667, "11": 0.0006668122950941324, "12": 0.0006263310206122696, "13": 0.0005904835416004062, "14": 0.12482554465532303, "15": 0.0005298344767652452, "16": 0.000503953720908612, "17": 0.0004804836062248796, "18": 0.0004591023316606879, "19": 0.00043954284046776593}}, {"key": "richardson2017learning", "year": "2017", "title": "Learning Technical Correspondences in Technical Documentation", "topic_distr": {"0": 0.002353269373998046, "1": 0.0019214922795072198, "2": 0.0016241632401943207, "3": 0.15779732167720795, "4": 0.0012407272588461637, "5": 0.001109738484956324, "6": 0.17935040593147278, "7": 0.06256681680679321, "8": 0.000842805951833725, "9": 0.0007802467443980277, "10": 0.0007263330044224858, "11": 0.0006793884676881135, "12": 0.0006381436833180487, "13": 0.0006016201805323362, "14": 0.000569051131606102, "15": 0.0005398272187449038, "16": 0.0005134583916515112, "17": 0.0004895456368103623, "18": 0.0004677610704675317, "19": 0.5851878523826599}}, {"key": "richardson2018polyglot", "year": "2018", "title": "Polyglot Semantic Parsing in APIs", "topic_distr": {"0": 0.13357385993003845, "1": 0.0020369498524814844, "2": 0.001721700420603156, "3": 0.14500494301319122, "4": 0.17728430032730103, "5": 0.0011763329384848475, "6": 0.0010640028631314635, "7": 0.06377604603767395, "8": 0.0008933820063248277, "9": 0.0008270686375908554, "10": 0.0007699195994064212, "11": 0.0007201579865068197, "12": 0.0006764381541870534, "13": 0.0006377228419296443, "14": 0.11167825758457184, "15": 0.0005722218193113804, "16": 0.0005442706169560552, "17": 0.0005189228104427457, "18": 0.0004958310164511204, "19": 0.35602760314941406}}, {"key": "richter2022can", "year": "2022", "title": "Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes", "topic_distr": {"0": 0.0011999360285699368, "1": 0.0009792180499061942, "2": 0.0008277232991531491, "3": 0.0007169073796831071, "4": 0.0006322866538539529, "5": 0.0005655335844494402, "6": 0.0005115296808071434, "7": 0.00046694077900610864, "8": 0.00042950204806402326, "9": 0.0003976212756242603, "10": 0.0003701463283505291, "11": 0.0003462229506112635, "12": 0.00032520422246307135, "13": 0.00030659144977107644, "14": 0.00028999397181905806, "15": 0.9906569123268127, "16": 0.00026166337192989886, "17": 0.00024947719066403806, "18": 0.00023837556364014745, "19": 0.0002282198693137616}}, {"key": "roziere2021dobf", "year": "2021", "title": "DOBF: A Deobfuscation Pre-Training Objective for Programming Languages", "topic_distr": {"0": 0.002836960833519697, "1": 0.0023150104098021984, "2": 0.001956484979018569, "3": 0.0016945431707426906, "4": 0.27644962072372437, "5": 0.001336742308922112, "6": 0.0012090945383533835, "7": 0.0011037004878744483, "8": 0.18801602721214294, "9": 0.0009398511028848588, "10": 0.21092753112316132, "11": 0.0008183616446331143, "12": 0.000768680009059608, "13": 0.0663767084479332, "14": 0.04159294068813324, "15": 0.0006502522737719119, "16": 0.000618489517364651, "17": 0.0005896852817386389, "18": 0.0005634445114992559, "19": 0.19923585653305054}}, {"key": "roziere2021leveraging", "year": "2021", "title": "Leveraging Automated Unit Tests for Unsupervised Code Translation", "topic_distr": {"0": 0.0021505607292056084, "1": 0.0017556967213749886, "2": 0.001484166830778122, "3": 0.0012854866217821836, "4": 0.0011337531032040715, "5": 0.0010140581289306283, "6": 0.0009172240388579667, "7": 0.0008372716256417334, "8": 0.0007701402646489441, "9": 0.0007129748119041324, "10": 0.0006637094775214791, "11": 0.0006208124104887247, "12": 0.0005831237649545074, "13": 0.0005497492384165525, "14": 0.9832748174667358, "15": 0.0004932840238325298, "16": 0.0004691886424552649, "17": 0.0004473376029636711, "18": 0.0004274312814231962, "19": 0.0004092211020179093}}, {"key": "russell2018automated", "year": "2018", "title": "Automated Vulnerability Detection in Source Code Using Deep Representation Learning", "topic_distr": {"0": 0.2368677407503128, "1": 0.0014974985970184207, "2": 0.0012659041676670313, "3": 0.26249101758003235, "4": 0.0009670337894931436, "5": 0.4897276759147644, "6": 0.0007823462947271764, "7": 0.0007141508394852281, "8": 0.0006568911485373974, "9": 0.0006081318715587258, "10": 0.0005661110044457018, "11": 0.0005295219598338008, "12": 0.0004973754403181374, "13": 0.00046890860539861023, "14": 0.000443523982539773, "15": 0.00042074659722857177, "16": 0.00040019446169026196, "17": 0.0003815566306002438, "18": 0.00036457754322327673, "19": 0.000349045149050653}}, {"key": "saberi2023model", "year": "2023", "title": "Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models", "topic_distr": {"0": 0.3418498635292053, "1": 0.0012124058557674289, "2": 0.0010247969767078757, "3": 0.0008876154315657914, "4": 0.09830556064844131, "5": 0.0007001928170211613, "6": 0.0006333302590064704, "7": 0.0005781241925433278, "8": 0.18998269736766815, "9": 0.040739718824625015, "10": 0.00045828201109543443, "11": 0.00042866222793236375, "12": 0.2939799726009369, "13": 0.0003795940720010549, "14": 0.0003590445267036557, "15": 0.02726960927248001, "16": 0.00032396812457591295, "17": 0.0003088802914135158, "18": 0.0002951352798845619, "19": 0.0002825613773893565}}, {"key": "sahu2022learning", "year": "2022", "title": "Learning to Answer Semantic Queries over Code", "topic_distr": {"0": 0.001710906159132719, "1": 0.48843494057655334, "2": 0.22127531468868256, "3": 0.0010213752975687385, "4": 0.0009008098277263343, "5": 0.000805706949904561, "6": 0.000728768645785749, "7": 0.0857672393321991, "8": 0.15024909377098083, "9": 0.044987063854932785, "10": 0.0005273418501019478, "11": 0.0004932585870847106, "12": 0.00046331348130479455, "13": 0.00043679619557224214, "14": 0.0004131500027142465, "15": 0.0003919324663002044, "16": 0.0003727878211066127, "17": 0.00035542636760510504, "18": 0.0003396100364625454, "19": 0.00032514138729311526}}, {"key": "saini2018oreo", "year": "2018", "title": "Oreo: detection of clones in the twilight zone", "topic_distr": {"0": 0.002446206286549568, "1": 0.0019964384846389294, "2": 0.0016878392780199647, "3": 0.9818494319915771, "4": 0.0012893734965473413, "5": 0.0011532505741342902, "6": 0.0010431244736537337, "7": 0.0009521975880488753, "8": 0.0008758516050875187, "9": 0.0008108395268209279, "10": 0.0007548118592239916, "11": 0.0007060266798362136, "12": 0.0006631647702306509, "13": 0.0006252091843634844, "14": 0.0005913631175644696, "15": 0.0005609933868981898, "16": 0.0005335906753316522, "17": 0.0005087402532808483, "18": 0.00048610157682560384, "19": 0.00046539181494154036}}, {"key": "santos2018syntax", "year": "2018", "title": "Syntax and Sensibility: Using language models to detect and correct syntax errors", "topic_distr": {"0": 0.0015222521033138037, "1": 0.0012421123683452606, "2": 0.0010498478077352047, "3": 0.0009092871914617717, "4": 0.0008019543020054698, "5": 0.0007172867772169411, "6": 0.0006487919017672539, "7": 0.675427258014679, "8": 0.0005447532166726887, "9": 0.0005043176934123039, "10": 0.17968957126140594, "11": 0.0004391272668726742, "12": 0.00041246842010878026, "13": 0.0003888611972797662, "14": 0.0003678099892567843, "15": 0.13409416377544403, "16": 0.0003318772651255131, "17": 0.00031642106478102505, "18": 0.000302340486086905, "19": 0.00028945962549187243}}, {"key": "saraiva2015products", "year": "2015", "title": "Products, Developers, and Milestones: How Should I Build My N-Gram Language Model", "topic_distr": {"0": 0.0017094507347792387, "1": 0.0013950266875326633, "2": 0.0011792290024459362, "3": 0.0010213485220447183, "4": 0.0009007881162688136, "5": 0.0008056880324147642, "6": 0.0007287515327334404, "7": 0.0006652278243564069, "8": 0.0006118907476775348, "9": 0.0005664717173203826, "10": 0.0005273295100778341, "11": 0.0004932470037601888, "12": 0.00046330265467986465, "13": 0.00043678595102392137, "14": 0.0004131403111387044, "15": 0.0003919232985936105, "16": 0.986670196056366, "17": 0.0003554180439095944, "18": 0.0003396020911168307, "19": 0.0003251337620895356}}, {"key": "sarkar2022what", "year": "2022", "title": "What is it like to program with artificial intelligence?", "topic_distr": {"0": 0.2655734121799469, "1": 0.0015669079730287194, "2": 0.0013243462890386581, "3": 0.0011470653116703033, "4": 0.0010116650955751538, "5": 0.000904859509319067, "6": 0.0008184530888684094, "7": 0.0007471102871932089, "8": 0.0006872079684399068, "9": 0.0006361983250826597, "10": 0.0005922380951233208, "11": 0.0005539604462683201, "12": 0.0005203302716836333, "13": 0.0004905496607534587, "14": 0.0004639934631995857, "15": 0.0004401648766361177, "16": 0.0004186642181593925, "17": 0.0003991661942563951, "18": 0.7213385701179504, "19": 0.00036515426472760737}}, {"key": "schrouff2019inferring", "year": "2019", "title": "Inferring Javascript types using Graph Neural Networks", "topic_distr": {"0": 0.0033707499969750643, "1": 0.0027523564640432596, "2": 0.0023266561329364777, "3": 0.002015150850638747, "4": 0.0017772749997675419, "5": 0.0015896433033049107, "6": 0.001437845639884472, "7": 0.0013125117402523756, "8": 0.0012072762474417686, "9": 0.38777193427085876, "10": 0.0010404346976429224, "11": 0.00097318907501176, "12": 0.5467190742492676, "13": 0.000861789972987026, "14": 0.0008151365327648818, "15": 0.041280727833509445, "16": 0.0007355028064921498, "17": 0.0007012489950284362, "18": 0.0006700436933897436, "19": 0.0006414973177015781}}, {"key": "schuster2021you", "year": "2021", "title": "You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion", "topic_distr": {"0": 0.002771488158032298, "1": 0.15145891904830933, "2": 0.0019129503052681684, "3": 0.001656860113143921, "4": 0.11428722739219666, "5": 0.0013070119312033057, "6": 0.0011822032975032926, "7": 0.0010791531531140208, "8": 0.0009926280472427607, "9": 0.11296264827251434, "10": 0.6045629382133484, "11": 0.0008001605165190995, "12": 0.0007515838369727135, "13": 0.0007085676770657301, "14": 0.0006702090031467378, "15": 0.0006357900565490127, "16": 0.0006047337665222585, "17": 0.0005765701062045991, "18": 0.0005509129841811955, "19": 0.0005274420254863799}}, {"key": "sharma2015nirmal", "year": "2015", "title": "NIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model", "topic_distr": {"0": 0.0013274378143250942, "1": 0.09216172993183136, "2": 0.21870802342891693, "3": 0.0007931945146992803, "4": 0.0006995658623054624, "5": 0.0006257095374166965, "6": 0.17077530920505524, "7": 0.0005166259361431003, "8": 0.00047520356019958854, "9": 0.00043993047438561916, "10": 0.34512677788734436, "11": 0.06236971542239189, "12": 0.000359807803761214, "13": 0.00033921454451046884, "14": 0.00032085098791867495, "15": 0.00030437353416346014, "16": 0.0002895058714784682, "17": 0.00027602299815043807, "18": 0.1038384661078453, "19": 0.00025250378530472517}}, {"key": "sharma2019feasibility", "year": "2019", "title": "On the Feasibility of Transfer-learning Code Smells using Deep Learning", "topic_distr": {"0": 0.35536882281303406, "1": 0.001008427469059825, "2": 0.000852326862514019, "3": 0.6094818115234375, "4": 0.0006510746898129582, "5": 0.027801329270005226, "6": 0.0005267295055091381, "7": 0.0004808156518265605, "8": 0.0004422644560690969, "9": 0.0004094363539479673, "10": 0.00038114498602226377, "11": 0.0003565107472240925, "12": 0.0003348674508742988, "13": 0.00031570164719596505, "14": 0.0002986109466291964, "15": 0.00028327564359642565, "16": 0.00026943854754790664, "17": 0.00025689025642350316, "18": 0.0002454587665852159, "19": 0.00023500128008890897}}, {"key": "sharma2022exploratory", "year": "2022", "title": "An Exploratory Study on Code Attention in BERT", "topic_distr": {"0": 0.2695702016353607, "1": 0.15591369569301605, "2": 0.0008198687573894858, "3": 0.3178298771381378, "4": 0.1404758244752884, "5": 0.020852498710155487, "6": 0.0005066761514171958, "7": 0.00046251030289568007, "8": 0.054824769496917725, "9": 0.00039384851697832346, "10": 0.0358533151447773, "11": 0.00034293788485229015, "12": 0.0003221185761503875, "13": 0.00030368243460543454, "14": 0.0002872424083761871, "15": 0.0002724909281823784, "16": 0.00025918064056895673, "17": 0.0002471100597176701, "18": 0.00023611378856003284, "19": 0.000226054442464374}}, {"key": "sharma2022lamner", "year": "2022", "title": "LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition", "topic_distr": {"0": 0.0012601392809301615, "1": 0.0010286866454407573, "2": 0.0008695366559550166, "3": 0.0007531332666985691, "4": 0.0006642293301410973, "5": 0.0005941031267866492, "6": 0.0005373712629079819, "7": 0.0004905297537334263, "8": 0.2261975109577179, "9": 0.04819336533546448, "10": 0.26325929164886475, "11": 0.0003637135087046772, "12": 0.19556227326393127, "13": 0.0003220798971597105, "14": 0.00030464393785223365, "15": 0.0002889987954404205, "16": 0.0002748821279965341, "17": 0.000262080313405022, "18": 0.0002504178846720606, "19": 0.25852301716804504}}, {"key": "she2019neuzz", "year": "2019", "title": "NEUZZ: Efficient Fuzzing with Neural Program Smoothing", "topic_distr": {"0": 0.0013860821491107345, "1": 0.39175713062286377, "2": 0.0009564918582327664, "3": 0.0008284486248157918, "4": 0.0007306601037271321, "5": 0.29984351992607117, "6": 0.0005911150947213173, "7": 0.0005395888583734632, "8": 0.0004963253159075975, "9": 0.0004594844067469239, "10": 0.00042773480527102947, "11": 0.0004000893677584827, "12": 0.000375800475012511, "13": 0.00035429190029390156, "14": 0.0003351120976731181, "15": 0.00031790227512829006, "16": 0.00030237375176511705, "17": 0.29935866594314575, "18": 0.00027546274941414595, "19": 0.000263727008132264}}, {"key": "shi2019learning", "year": "2019", "title": "Learning Execution through Neural Code Fusion", "topic_distr": {"0": 0.001642449526116252, "1": 0.0013401181204244494, "2": 0.0011326781241223216, "3": 0.0682598277926445, "4": 0.0008652519318275154, "5": 0.0007739036809653044, "6": 0.0007000023033469915, "7": 0.0006389846093952656, "8": 0.19464325904846191, "9": 0.4318496584892273, "10": 0.0005065263248980045, "11": 0.0004737884155474603, "12": 0.29464268684387207, "13": 0.00041955476626753807, "14": 0.0003968419332522899, "15": 0.0003764619177673012, "16": 0.0003580729535315186, "17": 0.00034139680792577565, "18": 0.0003262047830503434, "19": 0.00031230723834596574}}, {"key": "shi2022cv4code", "year": "2022", "title": "CV4Code: Sourcecode Understanding via Visual Code Representations", "topic_distr": {"0": 0.0016204604180529714, "1": 0.0013227687450125813, "2": 0.10504945367574692, "3": 0.0840747281908989, "4": 0.18903560936450958, "5": 0.000763862335588783, "6": 0.0006909198127686977, "7": 0.0006306938594207168, "8": 0.0005801256047561765, "9": 0.000537064450327307, "10": 0.0004999542143195868, "11": 0.0004676410462707281, "12": 0.6122287511825562, "13": 0.00041411106940358877, "14": 0.0003916929417755455, "15": 0.00037157736369408667, "16": 0.00035342699266038835, "17": 0.0003369672049302608, "18": 0.00032197232940234244, "19": 0.00030825508292764425}}, {"key": "shido2019automatic", "year": "2019", "title": "Automatic Source Code Summarization with Extended Tree-LSTM", "topic_distr": {"0": 0.0016636966029182076, "1": 0.19614596664905548, "2": 0.001147816306911409, "3": 0.0009941384196281433, "4": 0.0008767885155975819, "5": 0.000784222618676722, "6": 0.000709335959982127, "7": 0.0006475046975538135, "8": 0.0005955885862931609, "9": 0.0005513796349987388, "10": 0.000513280276209116, "11": 0.00048010580940172076, "12": 0.21682573854923248, "13": 0.00042514901724644005, "14": 0.4254133999347687, "15": 0.00038148160092532635, "16": 0.00036284743691794574, "17": 0.0003459489089436829, "18": 0.00033055435051210225, "19": 0.15080508589744568}}, {"key": "shirani2018evaluation", "year": "2018", "title": "Evaluation of Type Inference with Textual Cues", "topic_distr": {"0": 0.001891235588118434, "1": 0.0015431067440658808, "2": 0.12685444951057434, "3": 0.001129703945480287, "4": 0.000996340881101787, "5": 0.0008911534096114337, "6": 0.0008060555555857718, "7": 0.0007357934373430908, "8": 0.0006767984596081078, "9": 0.8599197268486023, "10": 0.0005832671886309981, "11": 0.0005455693462863564, "12": 0.0005124485469423234, "13": 0.00048311904538422823, "14": 0.0004569651500787586, "15": 0.00043349748011678457, "16": 0.00041232252260670066, "17": 0.0003931198443751782, "18": 0.00037562617217190564, "19": 0.00035962308174930513}}, {"key": "shrivastava2020on-the-fly", "year": "2020", "title": "On-the-Fly Adaptation of Source Code Models using Meta-Learning", "topic_distr": {"0": 0.001642700401134789, "1": 0.0013398657320067286, "2": 0.0011326716048642993, "3": 0.0009810490300878882, "4": 0.0599503293633461, "5": 0.0007738990825600922, "6": 0.0006999982288107276, "7": 0.0006389808841049671, "8": 0.0005877482471987605, "9": 0.4536955654621124, "10": 0.47510749101638794, "11": 0.00047378565068356693, "12": 0.00044502277160063386, "13": 0.0004195523215457797, "14": 0.0003968396340496838, "15": 0.00037645973498001695, "16": 0.00035807088715955615, "17": 0.00034139479976147413, "18": 0.0003262028913013637, "19": 0.0003123054339084774}}, {"key": "shrivastava2020repository", "year": "2022", "title": "Repository-Level Prompt Generation for Large Language Models of Code", "topic_distr": {"0": 0.4520576000213623, "1": 0.0016974852187559009, "2": 0.18262530863285065, "3": 0.001242674421519041, "4": 0.0010959904175251722, "5": 0.000980281736701727, "6": 0.0008866729913279414, "7": 0.0008093836368061602, "8": 0.0007444882648997009, "9": 0.0006892269011586905, "10": 0.11722277104854584, "11": 0.23617865145206451, "12": 0.0005637009744532406, "13": 0.0005314380396157503, "14": 0.0005026683793403208, "15": 0.00047685360186733305, "16": 0.00045356081682257354, "17": 0.0004324376059230417, "18": 0.0004131943278480321, "19": 0.000395590701373294}}, {"key": "shrivastava2023repofusion", "year": "2023", "title": "RepoFusion: Training Code Models to Understand Your Repository", "topic_distr": {"0": 0.2895601689815521, "1": 0.0011982235591858625, "2": 0.0010128132998943329, "3": 0.0008771998691372573, "4": 0.0007736473344266415, "5": 0.0006919694715179503, "6": 0.0006258920184336603, "7": 0.0005713343271054327, "8": 0.000525525480043143, "9": 0.19639746844768524, "10": 0.5046812891960144, "11": 0.00042362770182080567, "12": 0.0003979098401032388, "13": 0.00037513585994020104, "14": 0.0003548276727087796, "15": 0.00033660532790236175, "16": 0.00032016323530115187, "17": 0.0003052525862585753, "18": 0.00029166898457333446, "19": 0.000279242784017697}}, {"key": "shuai2020improving", "year": "2020", "title": "Improving Code Search with Co-Attentive Representation Learning", "topic_distr": {"0": 0.0016422284534201026, "1": 0.47895631194114685, "2": 0.19096927344799042, "3": 0.22283442318439484, "4": 0.0008652478572912514, "5": 0.0007738993735983968, "6": 0.0006999984034337103, "7": 0.0006389810587279499, "8": 0.0005877484218217432, "9": 0.000544121430721134, "10": 0.0005065235309302807, "11": 0.0004737857962027192, "12": 0.0004450228880159557, "13": 0.04991188645362854, "14": 0.00039683975046500564, "15": 0.0003764598513953388, "16": 0.0003580709744710475, "17": 0.048380691558122635, "18": 0.00032620300771668553, "19": 0.0003123055212199688}}, {"key": "si2018learning", "year": "2018", "title": "Learning Loop Invariants for Program Verification", "topic_distr": {"0": 0.002045258181169629, "1": 0.0016696980455890298, "2": 0.0014111887430772185, "3": 0.0012222904479131103, "4": 0.00107801693957299, "5": 0.0009642060031183064, "6": 0.0008721323101781309, "7": 0.0007961104274727404, "8": 0.12840205430984497, "9": 0.0006779241957701743, "10": 0.0006310807657428086, "11": 0.00059029262047261, "12": 0.057149212807416916, "13": 0.0005227229557931423, "14": 0.0004944250686094165, "15": 0.0004690336063504219, "16": 0.000446122809080407, "17": 0.0004253459919709712, "18": 0.3941130042076111, "19": 0.4060198962688446}}, {"key": "silavong2022senatus", "year": "2022", "title": "Senatus - A Fast and Accurate Code-to-Code Recommendation Engine", "topic_distr": {"0": 0.001621332485228777, "1": 0.20271891355514526, "2": 0.0011179788270965219, "3": 0.00096831627888605, "4": 0.0008540163980796933, "5": 0.0007638537208549678, "6": 0.16840343177318573, "7": 0.0006306866998784244, "8": 0.22298403084278107, "9": 0.000537058396730572, "10": 0.0004999485681764781, "11": 0.00046763577847741544, "12": 0.00043924624333158135, "13": 0.00041410638368688524, "14": 0.00039168851799331605, "15": 0.0003715731727425009, "16": 0.0003534230054356158, "17": 0.1834438145160675, "18": 0.00032196869142353535, "19": 0.21269701421260834}}, {"key": "silva2023repairllama", "year": "2023", "title": "RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair", "topic_distr": {"0": 0.0019509685225784779, "1": 0.0015911402879282832, "2": 0.0013450086116790771, "3": 0.10390612483024597, "4": 0.0010274728992953897, "5": 0.0009189980919472873, "6": 0.0008312414865940809, "7": 0.0007587839500047266, "8": 0.0006979455938562751, "9": 0.0006461389712058008, "10": 0.000601491890847683, "11": 0.0005626161000691354, "12": 0.0005284604267217219, "13": 0.000498214503750205, "14": 0.0004712434019893408, "15": 0.882075309753418, "16": 0.0004252058861311525, "17": 0.00040540320333093405, "18": 0.00038736293208785355, "19": 0.0003708598087541759}}, {"key": "singh2016question", "year": "2016", "title": "Question Independent Grading using Machine Learning: The Case of Computer Program Grading", "topic_distr": {"0": 0.17527763545513153, "1": 0.0011571778450161219, "2": 0.3934629559516907, "3": 0.0008472775225527585, "4": 0.11671004444360733, "5": 0.000668373191729188, "6": 0.0006045490736141801, "7": 0.0005518518155440688, "8": 0.07906956225633621, "9": 0.00046992689021863043, "10": 0.00043745574657805264, "11": 0.00040918198646977544, "12": 0.00038434111047536135, "13": 0.0003623437078204006, "14": 0.0003427280462346971, "15": 0.00032512706820853055, "16": 0.00030924566090106964, "17": 0.0002948434848804027, "18": 0.22804571688175201, "19": 0.00026972059276886284}}, {"key": "siow2019core", "year": "2019", "title": "CORE: Automating Review Recommendation for Code Changes", "topic_distr": {"0": 0.0012733639450743794, "1": 0.001039261231198907, "2": 0.32264918088912964, "3": 0.04198603332042694, "4": 0.0006710141897201538, "5": 0.08860059827566147, "6": 0.1352016180753708, "7": 0.0004955407348461449, "8": 0.000455808883998543, "9": 0.000421975419158116, "10": 0.3068738877773285, "11": 0.09802387654781342, "12": 0.00034512285492382944, "13": 0.0003253700560890138, "14": 0.0003077559813391417, "15": 0.00029195102979429066, "16": 0.00027769015287049115, "17": 0.00026475757476873696, "18": 0.0002529759658500552, "19": 0.00024219824990723282}}, {"key": "siow2022learning", "year": "2022", "title": "Learning Program Semantics with Code Representations: An Empirical Study", "topic_distr": {"0": 0.33057841658592224, "1": 0.0009012005757540464, "2": 0.0007617924711667001, "3": 0.3294006586074829, "4": 0.09080317616462708, "5": 0.0005204955814406276, "6": 0.0004707924381364137, "7": 0.00042975449468940496, "8": 0.0003952973347622901, "9": 0.00036595549318008125, "10": 0.00034066857188008726, "11": 0.00031865041819401085, "12": 0.243011012673378, "13": 0.0002821751113515347, "14": 0.00026689941296353936, "15": 0.0002531926438678056, "16": 0.00024082500021904707, "17": 0.0002296093007316813, "18": 0.00021939180442132056, "19": 0.00021004487643949687}}, {"key": "sivaraman2021mining", "year": "2021", "title": "Mining Idioms in the Wild", "topic_distr": {"0": 0.0017570063937455416, "1": 0.0014345420058816671, "2": 0.0012124869972467422, "3": 0.0010501513024792075, "4": 0.0009261893574148417, "5": 0.2595670521259308, "6": 0.0007493009907193482, "7": 0.0006839859997853637, "8": 0.5596887469291687, "9": 0.05023985728621483, "10": 0.0005421991809271276, "11": 0.0005071556661278009, "12": 0.11893222481012344, "13": 0.0004491024883463979, "14": 0.00042479007970541716, "15": 0.00040297480882145464, "16": 0.00038329075323417783, "17": 0.00036544015165418386, "18": 0.0003491782408673316, "19": 0.00033430190524086356}}, {"key": "souza2023lexecutor", "year": "2023", "title": "LExecutor: Learning-Guided Execution", "topic_distr": {"0": 0.0023106043227016926, "1": 0.14781349897384644, "2": 0.0015940795419737697, "3": 0.0013807222712785006, "4": 0.0012177475728094578, "5": 0.29747650027275085, "6": 0.000985174672678113, "7": 0.0008992991643026471, "8": 0.0008271944825537503, "9": 0.3518722355365753, "10": 0.0007128790020942688, "11": 0.0006668040296062827, "12": 0.0006263232789933681, "13": 0.000590476265642792, "14": 0.0005585104809142649, "15": 0.1498282253742218, "16": 0.0005039474926888943, "17": 0.00048047766904346645, "18": 0.03921578451991081, "19": 0.0004395373980514705}}, {"key": "spirin2021psiminer", "year": "2021", "title": "PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code", "topic_distr": {"0": 0.0015801831614226103, "1": 0.3737042248249054, "2": 0.0010896573076024652, "3": 0.0009438112610951066, "4": 0.0008323973743245006, "5": 0.0007445185910910368, "6": 0.0006734231137670577, "7": 0.0006147222593426704, "8": 0.0746506080031395, "9": 0.12777192890644073, "10": 0.00048729346599429846, "11": 0.0004557986103463918, "12": 0.3091566562652588, "13": 0.10526369512081146, "14": 0.00038177380338311195, "15": 0.00036216762964613736, "16": 0.00034447689540684223, "17": 0.00032843390363268554, "18": 0.0003138187457807362, "19": 0.0003004488826263696}}, {"key": "srikant2014system", "year": "2014", "title": "A system to grade computer programming skills using machine learning", "topic_distr": {"0": 0.0013005695072934031, "1": 0.0010608381126075983, "2": 0.0008967011817730963, "3": 0.0007766580674797297, "4": 0.0006849782075732946, "5": 0.0006126626394689083, "6": 0.0005541584687307477, "7": 0.0005058536771684885, "8": 0.0004652949864976108, "9": 0.00043075738358311355, "10": 0.00040099277975969017, "11": 0.00037507573142647743, "12": 0.00035230538924224675, "13": 0.0003321415279060602, "14": 0.0003141608613077551, "15": 0.00029802697827108204, "16": 0.00028346930048428476, "17": 0.00027026759926229715, "18": 0.9898378252983093, "19": 0.00024723875685594976}}, {"key": "sun2019grammar", "year": "2019", "title": "A Grammar-Based Structural CNN Decoder for Code Generation", "topic_distr": {"0": 0.0020136984530836344, "1": 0.1982671320438385, "2": 0.001388509408570826, "3": 0.00120257749222219, "4": 0.19753305613994598, "5": 0.0009486498311161995, "6": 0.0008580616558901966, "7": 0.0007832662668079138, "8": 0.15640753507614136, "9": 0.000666986801661551, "10": 0.0006208991399034858, "11": 0.0005807690322399139, "12": 0.26265156269073486, "13": 0.0005142894806340337, "14": 0.0004864481743425131, "15": 0.00046146640670485795, "16": 0.0004389252280816436, "17": 0.000418483599787578, "18": 0.00039986128103919327, "19": 0.17335781455039978}}, {"key": "sun2020pscs", "year": "2020", "title": "PSCS: A Path-based Neural Model for Semantic Code Search", "topic_distr": {"0": 0.0015799397369846702, "1": 0.31435227394104004, "2": 0.0010897045722231269, "3": 0.1668003350496292, "4": 0.0008323938236571848, "5": 0.0007445146911777556, "6": 0.0006734197377227247, "7": 0.19310371577739716, "8": 0.0005654317792505026, "9": 0.0005234613199718297, "10": 0.0004872910212725401, "11": 0.0004557963111437857, "12": 0.2927492558956146, "13": 0.02401137724518776, "14": 0.0003817718825303018, "15": 0.0003621657961048186, "16": 0.00034447514917701483, "17": 0.00032843227381817997, "18": 0.00031381717417389154, "19": 0.00030044736922718585}}, {"key": "svyatkovskiy2019pythia", "year": "2019", "title": "Pythia: AI-assisted Code Completion System", "topic_distr": {"0": 0.0024953398387879133, "1": 0.0020365617237985134, "2": 0.001721667591482401, "3": 0.0014911857433617115, "4": 0.0013151643797755241, "5": 0.0011763180373236537, "6": 0.0010639894753694534, "7": 0.03722454607486725, "8": 0.0008933706558309495, "9": 0.0008270581602118909, "10": 0.5095129013061523, "11": 0.0007201488479040563, "12": 0.07817923277616501, "13": 0.0006377148092724383, "14": 0.20322147011756897, "15": 0.0005722145433537662, "16": 0.15542161464691162, "17": 0.0005189162329770625, "18": 0.0004958247300237417, "19": 0.0004747007042169571}}, {"key": "svyatkovskiy2020fast", "year": "2020", "title": "Fast and Memory-Efficient Neural Code Completion", "topic_distr": {"0": 0.002046122681349516, "1": 0.22541090846061707, "2": 0.0014112673234194517, "3": 0.0012223124504089355, "4": 0.165326789021492, "5": 0.0009642146760597825, "6": 0.0008721397607587278, "7": 0.0007961171795614064, "8": 0.0007322855526581407, "9": 0.0006779299583286047, "10": 0.5556365251541138, "11": 0.0005902976263314486, "12": 0.0005544614396058023, "13": 0.04112813249230385, "14": 0.0004944292595610023, "15": 0.0004690375935751945, "16": 0.00044612662168219686, "17": 0.00042534960084594786, "18": 0.00040642175008542836, "19": 0.0003891066589858383}}, {"key": "svyatkovskiy2020intellicode", "year": "2020", "title": "IntelliCode Compose: Code Generation Using Transformer", "topic_distr": {"0": 0.002310337731614709, "1": 0.0018859395058825612, "2": 0.15134701132774353, "3": 0.028402196243405342, "4": 0.07270681858062744, "5": 0.0010892022401094437, "6": 0.0009851924842223525, "7": 0.06435693055391312, "8": 0.0008272093837149441, "9": 0.06706973165273666, "10": 0.2693024277687073, "11": 0.0006668160203844309, "12": 0.0006263345130719244, "13": 0.0005904868594370782, "14": 0.20362702012062073, "15": 0.0005298374453559518, "16": 0.05678366869688034, "17": 0.07599417120218277, "18": 0.00045910492190159857, "19": 0.00043954531429335475}}, {"key": "szafraniec2022code", "year": "2022", "title": "Code Translation with Compiler Representations", "topic_distr": {"0": 0.0018343649571761489, "1": 0.0014975358499214053, "2": 0.001265917788259685, "3": 0.001096464809961617, "4": 0.0009670331492088735, "5": 0.0008649391238577664, "6": 0.0007823447231203318, "7": 0.0007141493842937052, "8": 0.0006568898097611964, "9": 0.0006081306491978467, "10": 0.0005661098402924836, "11": 0.0005295209120959044, "12": 0.000497374392580241, "13": 0.0004689076740760356, "14": 0.7695752382278442, "15": 0.00042074575321748853, "16": 0.00040019367588683963, "17": 0.21654047071933746, "18": 0.00036457678652368486, "19": 0.000349044450558722}}, {"key": "tabassum2020code", "year": "2020", "title": "Code and Named Entity Recognition in StackOverflow", "topic_distr": {"0": 0.3957329988479614, "1": 0.0021667461842298508, "2": 0.0018316117348149419, "3": 0.001586408820003271, "4": 0.0013991388259455562, "5": 0.0012514253612607718, "6": 0.3647553324699402, "7": 0.0010332573438063264, "8": 0.0009504120680503547, "9": 0.08613239973783493, "10": 0.0008190682856366038, "11": 0.0007661301060579717, "12": 0.0007196193910203874, "13": 0.0006784326396882534, "14": 0.000641705293674022, "15": 0.0006087502697482705, "16": 0.000579014711547643, "17": 0.0005520488484762609, "18": 0.13729050755500793, "19": 0.0005050101899541914}}, {"key": "tan2024llm4decompile", "year": "2024", "title": "LLM4Decompile: Decompiling Binary Code with Large Language Models", "topic_distr": {"0": 0.6594117283821106, "1": 0.001885885838419199, "2": 0.0015941642923280597, "3": 0.0013807530049234629, "4": 0.0012177687603980303, "5": 0.0010892029386013746, "6": 0.0009851932991296053, "7": 0.0008993161027319729, "8": 0.0008272100822068751, "9": 0.0007658085087314248, "10": 0.0007128924480639398, "11": 0.00066681660246104, "12": 0.11904137581586838, "13": 0.02757655829191208, "14": 0.02209414355456829, "15": 0.0005298379110172391, "16": 0.0005039570387452841, "17": 0.00048048674943856895, "18": 0.15789732336997986, "19": 0.0004395456926431507}}, {"key": "tarlow2019learning", "year": "2019", "title": "Learning to Fix Build Errors with Graph2Diff Neural Networks", "topic_distr": {"0": 0.0022699618712067604, "1": 0.001851815264672041, "2": 0.0015651629073545337, "3": 0.0013556446647271514, "4": 0.0011956164380535483, "5": 0.0010693909134715796, "6": 0.0009672731393948197, "7": 0.09674162417650223, "8": 0.0008121635182760656, "9": 0.0007518788333982229, "10": 0.000699925294611603, "11": 0.14764030277729034, "12": 0.2470110058784485, "13": 0.0005797466728836298, "14": 0.0005483618006110191, "15": 0.2063165307044983, "16": 0.0004947902634739876, "17": 0.2872465252876282, "18": 0.0004507543926592916, "19": 0.0004315505502745509}}, {"key": "theeten2019import2vec", "year": "2019", "title": "Import2vec - Learning Embeddings for Software Libraries", "topic_distr": {"0": 0.0016428640810772777, "1": 0.00134029530454427, "2": 0.0011326895328238606, "3": 0.24361568689346313, "4": 0.0008652409887872636, "5": 0.0007738923304714262, "6": 0.7449023723602295, "7": 0.0006389752961695194, "8": 0.0005877430667169392, "9": 0.0005441164830699563, "10": 0.0005065189907327294, "11": 0.00047378151793964207, "12": 0.00044501887168735266, "13": 0.00041954865446314216, "14": 0.000396836141590029, "15": 0.00037645644624717534, "16": 0.0003580677439458668, "17": 0.00034139183117076755, "18": 0.00032620003912597895, "19": 0.0003123026981484145}}, {"key": "tian2020evaluating", "year": "2020", "title": "Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair", "topic_distr": {"0": 0.001758308382704854, "1": 0.10558831691741943, "2": 0.0012125199427828193, "3": 0.3043586015701294, "4": 0.0009261871455237269, "5": 0.08776921033859253, "6": 0.0007492999429814517, "7": 0.0006839850684627891, "8": 0.05443520098924637, "9": 0.0005824443651363254, "10": 0.0005421984242275357, "11": 0.0005071549094282091, "12": 0.0004763662291225046, "13": 0.0004491018771659583, "14": 0.000424789497628808, "15": 0.3007489740848541, "16": 0.0003832902293652296, "17": 0.13772058486938477, "18": 0.00034917774610221386, "19": 0.00033430143957957625}}, {"key": "tian2024debugbench", "year": "2024", "title": "DebugBench: Evaluating Debugging Capability of Large Language Models", "topic_distr": {"0": 0.6751835346221924, "1": 0.0016971523873507977, "2": 0.0014347097603604198, "3": 0.0012426689499989152, "4": 0.0010959840146824718, "5": 0.0009802754502743483, "6": 0.0008866671123541892, "7": 0.0008093783399090171, "8": 0.0007444833754561841, "9": 0.026514803990721703, "10": 0.000641598307993263, "11": 0.000600130355451256, "12": 0.0005636972491629422, "13": 0.0005314346053637564, "14": 0.0005026650615036488, "15": 0.182308167219162, "16": 0.00045355784823186696, "17": 0.0004324347828514874, "18": 0.00041319162119179964, "19": 0.1029634177684784}}, {"key": "tomczak2019simulating", "year": "2019", "title": "Simulating Execution Time of Tensor Programs using Graph Neural Networks", "topic_distr": {"0": 0.0026568453758955, "1": 0.0021668614353984594, "2": 0.00183169636875391, "3": 0.0015863855369389057, "4": 0.0013991249725222588, "5": 0.0012514143018051982, "6": 0.001131914439611137, "7": 0.0010332479141652584, "8": 0.27659666538238525, "9": 0.0008798575145192444, "10": 0.0008190608350560069, "11": 0.0007661231211386621, "12": 0.4154609143733978, "13": 0.0006784264696761966, "14": 0.0006416994729079306, "15": 0.2889362573623657, "16": 0.0005790094728581607, "17": 0.0005520438426174223, "18": 0.0005274781724438071, "19": 0.0005050055915489793}}, {"key": "tran2019recovering", "year": "2019", "title": "Recovering Variable Names for Minified Code with Usage Contexts", "topic_distr": {"0": 0.001834674272686243, "1": 0.0014976527309045196, "2": 0.001265907078050077, "3": 0.001096460036933422, "4": 0.0009670377476140857, "5": 0.0008649405790492892, "6": 0.0007823460036888719, "7": 0.0007141506066545844, "8": 0.0006568909157067537, "9": 0.44799116253852844, "10": 0.0005661108298227191, "11": 0.000529521785210818, "12": 0.0004973752656951547, "13": 0.5383760929107666, "14": 0.0004435238370206207, "15": 0.0004207464517094195, "16": 0.00040019434527494013, "17": 0.0003815564850810915, "18": 0.00036457739770412445, "19": 0.00034904503263533115}}, {"key": "tu2014localness", "year": "2014", "title": "On the Localness of Software", "topic_distr": {"0": 0.00319649581797421, "1": 0.002610984491184354, "2": 0.0022072074934840202, "3": 0.001911756582558155, "4": 0.0016860971227288246, "5": 0.0015080892480909824, "6": 0.0013640793040394783, "7": 0.001245175488293171, "8": 0.00114533887244761, "9": 0.0010603234404698014, "10": 0.0009870568756014109, "11": 0.0009232612210325897, "12": 0.0008672112599015236, "13": 0.0008175772964023054, "14": 0.0007733172969892621, "15": 0.0007336031994782388, "16": 0.9750529527664185, "17": 0.0006652725278399885, "18": 0.0006356682279147208, "19": 0.0006085863569751382}}, {"key": "tufano2018deep", "year": "2018", "title": "Deep Learning Similarities from Different Representations of Source Code", "topic_distr": {"0": 0.001782295759767294, "1": 0.0014546335441991687, "2": 0.0012297285720705986, "3": 0.9867758750915527, "4": 0.0009394002263434231, "5": 0.0008402231615036726, "6": 0.0007599889067932963, "7": 0.0006937423022463918, "8": 0.0006381189450621605, "9": 0.0005907531012780964, "10": 0.0005499330582097173, "11": 0.0005143896560184658, "12": 0.0004831617698073387, "13": 0.00045550844515673816, "14": 0.00043084926437586546, "15": 0.0004087227862328291, "16": 0.0003887579950969666, "17": 0.0003706527641043067, "18": 0.0003541588957887143, "19": 0.0003390703641343862}}, {"key": "tufano2018empirical", "year": "2018", "title": "An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation", "topic_distr": {"0": 0.24729779362678528, "1": 0.0014760082121938467, "2": 0.0012476576957851648, "3": 0.001080583082512021, "4": 0.0009530318784527481, "5": 0.1503896266222, "6": 0.0007710178615525365, "7": 0.0007038098992779851, "8": 0.2391417771577835, "9": 0.0005993261002004147, "10": 0.0005579136777669191, "11": 0.0005218544392846525, "12": 0.0004901733482256532, "13": 0.00046211876906454563, "14": 0.15943486988544464, "15": 0.19339872896671295, "16": 0.00039439962711185217, "17": 0.00037603164673782885, "18": 0.00035929842852056026, "19": 0.00034399094874970615}}, {"key": "tufano2018learning", "year": "2018", "title": "Learning How to Mutate Source Code from Bug-Fixes", "topic_distr": {"0": 0.002044553868472576, "1": 0.0016692178323864937, "2": 0.3974129259586334, "3": 0.0012222772929817438, "4": 0.0010780011070892215, "5": 0.0009641924407333136, "6": 0.0008721199701540172, "7": 0.0007960991933941841, "8": 0.0007322690216824412, "9": 0.0006779146497137845, "10": 0.0006310718599706888, "11": 0.3588770925998688, "12": 0.0005544489249587059, "13": 0.0005227155634202063, "14": 0.0004944180836901069, "15": 0.22978371381759644, "16": 0.0004461165517568588, "17": 0.00042533999658189714, "18": 0.0004064125823788345, "19": 0.00038909786962904036}}, {"key": "tufano2019learning", "year": "2019", "title": "On Learning Meaningful Code Changes via Neural Machine Translation", "topic_distr": {"0": 0.0014858735958114266, "1": 0.0012124063214287162, "2": 0.001024827011860907, "3": 0.0008876274223439395, "4": 0.000782847695518285, "5": 0.0007001982303336263, "6": 0.0006333349156193435, "7": 0.0005781284417025745, "8": 0.0005317748873494565, "9": 0.0004923026426695287, "10": 0.00045828535803593695, "11": 0.9885199666023254, "12": 0.0004026416572742164, "13": 0.00037959686596877873, "14": 0.0003590471751522273, "15": 0.0003406081523280591, "16": 0.0003239705110900104, "17": 0.00030888256151229143, "18": 0.0002951374335680157, "19": 0.0002825634728651494}}, {"key": "tufano2020generating", "year": "2020", "title": "Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers", "topic_distr": {"0": 0.0014519604155793786, "1": 0.0011840123916044831, "2": 0.6871699690818787, "3": 0.00086697016377002, "4": 0.05896596610546112, "5": 0.0006839105626568198, "6": 0.24461613595485687, "7": 0.0005646805511787534, "8": 0.0005194051773287356, "9": 0.00048085112939588726, "10": 0.00044762511970475316, "11": 0.000418694136897102, "12": 0.00039327575359493494, "13": 0.0003707669966388494, "14": 0.0003506953362375498, "15": 0.00033268521656282246, "16": 0.00031643459806218743, "17": 0.0003016976115759462, "18": 0.0002882721892092377, "19": 0.00027599072200246155}}, {"key": "tufano2020unit", "year": "2020", "title": "Unit Test Case Generation with Transformers", "topic_distr": {"0": 0.0011443205876275897, "1": 0.0009341819095425308, "2": 0.9916133880615234, "3": 0.000684026163071394, "4": 0.0006032859091646969, "5": 0.0005395946791395545, "6": 0.0004880679480265826, "7": 0.00044552411418408155, "8": 0.00040980256744660437, "9": 0.00037938402965664864, "10": 0.00035316922003403306, "11": 0.0003303431149106473, "12": 0.00031028842204250395, "13": 0.0002925293520092964, "14": 0.00027669311384670436, "15": 0.00026248343056067824, "16": 0.0002496619417797774, "17": 0.00023803468502592295, "18": 0.00022744225861970335, "19": 0.0002177523565478623}}, {"key": "vaithilingam2022expectation", "year": "2022", "title": "Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models", "topic_distr": {"0": 0.42662355303764343, "1": 0.1132374033331871, "2": 0.0014590417267754674, "3": 0.001263724989257753, "4": 0.2774912118911743, "5": 0.0009968893136829138, "6": 0.00090169464237988, "7": 0.0008230958483181894, "8": 0.0007571010501123965, "9": 0.0007009035325609148, "10": 0.056267041712999344, "11": 0.0006103015039116144, "12": 0.0005732509307563305, "13": 0.0005404414841905236, "14": 0.0005111843929626048, "15": 0.00048493227222934365, "16": 0.0004612448683474213, "17": 0.000439763767644763, "18": 0.11545491963624954, "19": 0.0004022926150355488}}, {"key": "vasic2019neural", "year": "2019", "title": "Neural Program Repair by Jointly Learning to Localize and Repair", "topic_distr": {"0": 0.002044500084593892, "1": 0.25139859318733215, "2": 0.001411193166859448, "3": 0.0012223043013364077, "4": 0.0010780165903270245, "5": 0.14294728636741638, "6": 0.0008721327758394182, "7": 0.0007961108349263668, "8": 0.0007322796736843884, "9": 0.0006779245450161397, "10": 0.0006310810567811131, "11": 0.0005902929115109146, "12": 0.0005544570158235729, "13": 0.000522723188623786, "14": 0.0004944253014400601, "15": 0.5262203812599182, "16": 0.06658540666103363, "17": 0.0004253461956977844, "18": 0.0004064184904564172, "19": 0.0003891035448759794}}, {"key": "vasilescu2017recovering", "year": "2017", "title": "Recovering Clear, Natural Identifiers from Obfuscated JS Names", "topic_distr": {"0": 0.19321170449256897, "1": 0.0024834114592522383, "2": 0.002099518897011876, "3": 0.0018185055814683437, "4": 0.0016038522589951754, "5": 0.0014345287345349789, "6": 0.0012975431745871902, "7": 0.0011844390537589788, "8": 0.0010894723236560822, "9": 0.366250604391098, "10": 0.0009389108745381236, "11": 0.0008782270015217364, "12": 0.000824910996016115, "13": 0.3471147418022156, "14": 0.07459170371294022, "15": 0.0006978199817240238, "16": 0.0006637336919084191, "17": 0.0006328222807496786, "18": 0.0006046619964763522, "19": 0.0005789010901935399}}, {"key": "villmow2021contest", "year": "2021", "title": "ConTest: A Unit Test Completion Benchmark featuring Context", "topic_distr": {"0": 0.2421063333749771, "1": 0.0022134508471935987, "2": 0.4212610125541687, "3": 0.0016208747401833534, "4": 0.00142953812610358, "5": 0.0012786147417500615, "6": 0.0011565177701413631, "7": 0.001055706525221467, "8": 0.0009710613521747291, "9": 0.0008989821071736515, "10": 0.32030850648880005, "11": 0.0007827755762264132, "12": 0.0007352543179877102, "13": 0.0006931727402843535, "14": 0.0006556474254466593, "15": 0.0006219763890840113, "16": 0.0005915947840549052, "17": 0.0005640430608764291, "18": 0.0005389433936215937, "19": 0.000515982392244041}}, {"key": "wan2018improving", "year": "2018", "title": "Improving Automatic Source Code Summarization via Deep Reinforcement Learning", "topic_distr": {"0": 0.14788036048412323, "1": 0.0013057463802397251, "2": 0.06741087883710861, "3": 0.0009559001191519201, "4": 0.0008430648013018072, "5": 0.0007540592923760414, "6": 0.0006820530397817492, "7": 0.18544568121433258, "8": 0.0005726806703023612, "9": 0.0005301721394062042, "10": 0.0004935381002724171, "11": 0.0004616396618075669, "12": 0.21748587489128113, "13": 0.00040879662265069783, "14": 0.0003866662154905498, "15": 0.0003668087883852422, "16": 0.00034889133530668914, "17": 0.00033264278317801654, "18": 0.00031784034217707813, "19": 0.37301671504974365}}, {"key": "wan2019multimodal", "year": "2019", "title": "Multi-Modal Attention Network Learning for Semantic Source Code Retrieval", "topic_distr": {"0": 0.0013713166117668152, "1": 0.2312750369310379, "2": 0.08205465972423553, "3": 0.22887249290943146, "4": 0.0007226351881399751, "5": 0.0006463436293415725, "6": 0.0005846232525072992, "7": 0.0005336628528311849, "8": 0.0004908744595013559, "9": 0.00045443818089552224, "10": 0.0004230372724123299, "11": 0.0003956954169552773, "12": 0.2041761428117752, "13": 0.02694624476134777, "14": 0.00033143177279271185, "15": 0.0003144109505228698, "16": 0.0002990529756061733, "17": 0.00028512548306025565, "18": 0.0002724375226534903, "19": 0.21955034136772156}}, {"key": "wan2020naturalcc", "year": "2020", "title": "NaturalCC: A Toolkit to Naturalize the Source Code Corpus", "topic_distr": {"0": 0.3190641403198242, "1": 0.002368865767493844, "2": 0.002001980086788535, "3": 0.0017339737387374043, "4": 0.0015293015167117119, "5": 0.0013678456889465451, "6": 0.001237227814272046, "7": 0.0011293812422081828, "8": 0.0010388288646936417, "9": 0.15824584662914276, "10": 0.033977631479501724, "11": 0.0008374031749553978, "12": 0.00078656553523615, "13": 0.0007415472064167261, "14": 0.000701403187122196, "15": 0.0006653822492808104, "16": 0.0006328804884105921, "17": 0.0006034059915691614, "18": 0.06266684085130692, "19": 0.4086695611476898}}, {"key": "wan2022what", "year": "2022", "title": "What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code", "topic_distr": {"0": 0.14605864882469177, "1": 0.12932702898979187, "2": 0.0012848314363509417, "3": 0.001112836180254817, "4": 0.1882133036851883, "5": 0.0008778533083386719, "6": 0.0007940253708511591, "7": 0.0007248119218274951, "8": 0.0006666974513791502, "9": 0.0006172102876007557, "10": 0.06136214733123779, "11": 0.0005374268512241542, "12": 0.465552419424057, "13": 0.00047590862959623337, "14": 0.00045014507486484945, "15": 0.0004270276695024222, "16": 0.0004061687213834375, "17": 0.00038725262857042253, "18": 0.00037002007593400776, "19": 0.0003542558115441352}}, {"key": "wang2016automatically", "year": "2016", "title": "Automatically Learning Semantic Features for Defect Prediction", "topic_distr": {"0": 0.0012729291338473558, "1": 0.0010392534313723445, "2": 0.0008783906232565641, "3": 0.314773291349411, "4": 0.0006710032466799021, "5": 0.5902618169784546, "6": 0.0005428526201285422, "7": 0.0004955333424732089, "8": 0.0004558021028060466, "9": 0.0004219691618345678, "10": 0.00039281180943362415, "11": 0.0003674234903883189, "12": 0.0864642858505249, "13": 0.000325365224853158, "14": 0.00030775141203776, "15": 0.0002919466933235526, "16": 0.0002776860201265663, "17": 0.00026475361664779484, "18": 0.0002529722114559263, "19": 0.00024219464103225619}}, {"key": "wang2016bugram", "year": "2016", "title": "Bugram: bug detection with n-gram language models", "topic_distr": {"0": 0.0012723684776574373, "1": 0.0010391934774816036, "2": 0.0008784057572484016, "3": 0.0007608255837112665, "4": 0.0006710044108331203, "5": 0.8867053985595703, "6": 0.0005428532604128122, "7": 0.0004955338663421571, "8": 0.104144386947155, "9": 0.0004219696274958551, "10": 0.00039281221688725054, "11": 0.0003674238978419453, "12": 0.00034511808189563453, "13": 0.0003253655740991235, "14": 0.00030775173217989504, "15": 0.00029194701346568763, "16": 0.00027768631116487086, "17": 0.0002647539076860994, "18": 0.00025297250249423087, "19": 0.0002421949029667303}}, {"key": "wang2016neural", "year": "2016", "title": "Neural Code Completion", "topic_distr": {"0": 0.002045680768787861, "1": 0.4498971402645111, "2": 0.0014111693017184734, "3": 0.0012222824152559042, "4": 0.0010780016891658306, "5": 0.0009641918004490435, "6": 0.13061444461345673, "7": 0.0007960986695252359, "8": 0.000732268497813493, "9": 0.0006779141840524971, "10": 0.27904990315437317, "11": 0.000590283889323473, "12": 0.12776745855808258, "13": 0.0005227152141742408, "14": 0.0004944177344441414, "15": 0.00046902670874260366, "16": 0.0004461162316147238, "17": 0.00042533970554359257, "18": 0.0004064122913405299, "19": 0.00038909760769456625}}, {"key": "wang2019learning", "year": "2019", "title": "Learning Scalable and Precise Representation of Program Semantics", "topic_distr": {"0": 0.0014683807967230678, "1": 0.001198125653900206, "2": 0.0010127800051122904, "3": 0.16570699214935303, "4": 0.000773636915255338, "5": 0.0006919611478224397, "6": 0.0006258843932300806, "7": 0.000571327400393784, "8": 0.0005255190772004426, "9": 0.10306714475154877, "10": 0.00045289413537830114, "11": 0.0004236225795466453, "12": 0.4286749064922333, "13": 0.0003751313197426498, "14": 0.000354823365341872, "15": 0.0003366012533660978, "16": 0.0003201593644917011, "17": 0.29284918308258057, "18": 0.0002916654630098492, "19": 0.000279239407973364}}, {"key": "wang2020blended", "year": "2020", "title": "Blended, precise semantic program embeddings", "topic_distr": {"0": 0.0015992306871339679, "1": 0.0013056101743131876, "2": 0.00110362539999187, "3": 0.15419064462184906, "4": 0.0008430596790276468, "5": 0.0007540533551946282, "6": 0.0006820474518463016, "7": 0.0006225948454812169, "8": 0.0005726760136894882, "9": 0.060716256499290466, "10": 0.000493534083943814, "11": 0.00046163590741343796, "12": 0.27351024746894836, "13": 0.10863132774829865, "14": 0.0003866630722768605, "15": 0.00036680581979453564, "16": 0.00034888851223513484, "17": 0.39278900623321533, "18": 0.0003178377519361675, "19": 0.0003042966709472239}}, {"key": "wang2020cocogum", "year": "2020", "title": "CoCoGUM: Contextual Code Summarization with Multi-Relational GNN on UMLs", "topic_distr": {"0": 0.0015022115549072623, "1": 0.001227058470249176, "2": 0.0010371378157287836, "3": 0.0008982995059341192, "4": 0.0007922651711851358, "5": 0.000708622916135937, "6": 0.0006409554625861347, "7": 0.0005850847228430212, "8": 0.0005381734226830304, "9": 0.0004982262616977096, "10": 0.00046379963168874383, "11": 0.0004338232392910868, "12": 0.22490745782852173, "13": 0.0003841643047053367, "14": 0.0003633673768490553, "15": 0.00034470646642148495, "16": 0.5073975324630737, "17": 0.00031259917886927724, "18": 0.0002986886538565159, "19": 0.25666579604148865}}, {"key": "wang2020detecting", "year": "2020", "title": "Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree", "topic_distr": {"0": 0.0010409337701275945, "1": 0.09524805098772049, "2": 0.0007173665799200535, "3": 0.3496934473514557, "4": 0.0005480009131133556, "5": 0.033094361424446106, "6": 0.00044334109406918287, "7": 0.0004046960093546659, "8": 0.15279148519039154, "9": 0.0003446170303504914, "10": 0.00032080456730909646, "11": 0.00030007027089595795, "12": 0.36344993114471436, "13": 0.00026572178467176855, "14": 0.00025133677991107106, "15": 0.00023842927475925535, "16": 0.00022678276582155377, "17": 0.00021622104395646602, "18": 0.00020659930305555463, "19": 0.00019779738795477897}}, {"key": "wang2020learning", "year": "2020", "title": "Learning Semantic Program Embeddings with Graph Interval Neural Network", "topic_distr": {"0": 0.0011659828014671803, "1": 0.0009517634171061218, "2": 0.0008045464055612683, "3": 0.11088888347148895, "4": 0.03538781404495239, "5": 0.21505488455295563, "6": 0.0004971984890289605, "7": 0.0004538587527349591, "8": 0.0004174689238425344, "9": 0.00038648134795948863, "10": 0.0003597761387936771, "11": 0.00033652299316599965, "12": 0.3759152889251709, "13": 0.21130353212356567, "14": 0.0002818693465087563, "15": 0.00026739382883533835, "16": 0.04483070224523544, "17": 0.0002424877166049555, "18": 0.00023169713676907122, "19": 0.0002218259614892304}}, {"key": "wang2020learning2", "year": "2020", "title": "Learning to Represent Programs with Heterogeneous Graphs", "topic_distr": {"0": 0.0011036075884476304, "1": 0.0009011361398734152, "2": 0.0007617829251103103, "3": 0.000659812823869288, "4": 0.026630552485585213, "5": 0.0005204914486967027, "6": 0.00047078891657292843, "7": 0.00042975126416422427, "8": 0.17327603697776794, "9": 0.11523884534835815, "10": 0.00034066601074300706, "11": 0.00031864800257608294, "12": 0.677645742893219, "13": 0.0002821729576680809, "14": 0.0002668973756954074, "15": 0.0002531907521188259, "16": 0.00024082318122964352, "17": 0.0002296075690537691, "18": 0.00021939014550298452, "19": 0.00021004329028073698}}, {"key": "wang2020modular", "year": "2020", "title": "Modular Tree Network for Source Code Representation Learning", "topic_distr": {"0": 0.0017085325671359897, "1": 0.001395008061081171, "2": 0.001179225044324994, "3": 0.14465634524822235, "4": 0.0009007921325974166, "5": 0.0008056916412897408, "6": 0.0007287548505701125, "7": 0.0006652309093624353, "8": 0.0006118935416452587, "9": 0.022904392331838608, "10": 0.0005273318965919316, "11": 0.0004932492738589644, "12": 0.8207888007164001, "13": 0.0004367879591882229, "14": 0.0004131422028876841, "15": 0.00039192510303109884, "16": 0.0003727808070834726, "17": 0.0003554196737241, "18": 0.00033960366272367537, "19": 0.0003251352463848889}}, {"key": "wang2020trans", "year": "2020", "title": "TranS^3: A Transformer-based Framework for Unifying Code Summarization and Code Search", "topic_distr": {"0": 0.0023085682187229395, "1": 0.35462602972984314, "2": 0.1600896418094635, "3": 0.0013807113282382488, "4": 0.001217738026753068, "5": 0.0010891759302467108, "6": 0.0009851688519120216, "7": 0.000899293867405504, "8": 0.0008271896513178945, "9": 0.0007657895912416279, "10": 0.000712874811142683, "11": 0.0006668001296930015, "12": 0.0006263195537030697, "13": 0.0005904727731831372, "14": 0.0005585072212852538, "15": 0.0005298248142935336, "16": 0.0005039445823058486, "17": 0.0004804748750757426, "18": 0.0004590939497575164, "19": 0.47068238258361816}}, {"key": "wang2021codet5", "year": "2021", "title": "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation", "topic_distr": {"0": 0.0016212622867897153, "1": 0.0013223914429545403, "2": 0.001117988838814199, "3": 0.1929846704006195, "4": 0.41097164154052734, "5": 0.0007638526149094105, "6": 0.000690910906996578, "7": 0.0006306857103481889, "8": 0.0005801181541755795, "9": 0.0005370575236156583, "10": 0.09686244279146194, "11": 0.00046763502177782357, "12": 0.22118327021598816, "13": 0.0004141057434026152, "14": 0.00039168790681287646, "15": 0.00037157259066589177, "16": 0.0003534224524628371, "17": 0.0003369628684595227, "18": 0.0003219681675545871, "19": 0.06807634979486465}}, {"key": "wang2021syncobert", "year": "2021", "title": "SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation", "topic_distr": {"0": 0.0018900723662227392, "1": 0.05528013035655022, "2": 0.0013042994542047381, "3": 0.2990160584449768, "4": 0.15366815030574799, "5": 0.0008911669137887657, "6": 0.0008060679538175464, "7": 0.0007358047878369689, "8": 0.043661460280418396, "9": 0.0006265711854211986, "10": 0.17777469754219055, "11": 0.000545577728189528, "12": 0.24164694547653198, "13": 0.00048312649596482515, "14": 0.019695648923516273, "15": 0.00043350414489395916, "16": 0.0004123288672417402, "17": 0.00039312586886808276, "18": 0.0003756319638341665, "19": 0.0003596286114770919}}, {"key": "wang2023codet5", "year": "2023", "title": "CodeT5+: Open Code Large Language Models for Code Understanding and Generation", "topic_distr": {"0": 0.0014180367579683661, "1": 0.0011576757533475757, "2": 0.0009781940607354045, "3": 0.0008472554618492723, "4": 0.9893799424171448, "5": 0.0006683586980216205, "6": 0.0006045360350981355, "7": 0.0005518399411812425, "8": 0.0005075941444374621, "9": 0.00046991679118946195, "10": 0.00043744631693698466, "11": 0.0004091731971129775, "12": 0.00038433284498751163, "13": 0.0003623359079938382, "14": 0.0003427206538617611, "15": 0.00032512008328922093, "16": 0.00030923899612389505, "17": 0.0002948371402453631, "18": 0.00028171701706014574, "19": 0.00026971480110660195}}, {"key": "wang2023deepvd", "year": "2023", "title": "DeepVD: Toward Class-Separation Features for Neural Network Vulnerability Detection", "topic_distr": {"0": 0.069327212870121, "1": 0.0015431579668074846, "2": 0.0013043158687651157, "3": 0.27722716331481934, "4": 0.0009963500779122114, "5": 0.3289972245693207, "6": 0.029041241854429245, "7": 0.0007357998983934522, "8": 0.000676804396789521, "9": 0.10445307195186615, "10": 0.0005832723109051585, "11": 0.0005455741193145514, "12": 0.18165446817874908, "13": 0.0004831232945434749, "14": 0.0004569691664073616, "15": 0.0004335012927185744, "16": 0.0004123261314816773, "17": 0.0003931232786271721, "18": 0.0003756294900085777, "19": 0.0003596262540668249}}, {"key": "watson2021systematic", "year": "2021", "title": "A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research", "topic_distr": {"0": 0.9192996621131897, "1": 0.0015914214309304953, "2": 0.001345056458376348, "3": 0.06818534433841705, "4": 0.001027480699121952, "5": 0.0009190055425278842, "6": 0.0008312480640597641, "7": 0.0007587899453938007, "8": 0.0006979511235840619, "9": 0.0006461440934799612, "10": 0.0006014966638758779, "11": 0.0005626205820590258, "12": 0.0005284646176733077, "13": 0.0004982184618711472, "14": 0.00047124712727963924, "15": 0.0004470460116863251, "16": 0.0004252092621754855, "17": 0.0004054064047522843, "18": 0.00038736601709388196, "19": 0.00037086274824105203}}, {"key": "waunakh2019idbench", "year": "2021", "title": "IdBench: Evaluating Semantic Representations of Identifier Names in Source Code", "topic_distr": {"0": 0.14052671194076538, "1": 0.0010287846671417356, "2": 0.0008695614524185658, "3": 0.48451298475265503, "4": 0.0006642401567660272, "5": 0.16810673475265503, "6": 0.12104205787181854, "7": 0.0004905387759208679, "8": 0.0004512080049607903, "9": 0.00041771604446694255, "10": 0.000388852582545951, "11": 0.0003637201734818518, "12": 0.04021038860082626, "13": 0.03930572420358658, "14": 0.00030464952578768134, "15": 0.00028900409233756363, "16": 0.0002748871629592031, "17": 0.00026208514464087784, "18": 0.0002504224539734423, "19": 0.0002397535281488672}}, {"key": "wei2019code", "year": "2019", "title": "Code Generation as a Dual Task of Code Summarization", "topic_distr": {"0": 0.0919736847281456, "1": 0.21913544833660126, "2": 0.001655449508689344, "3": 0.001433821627870202, "4": 0.2720148265361786, "5": 0.0011310705449432135, "6": 0.0010230628540739417, "7": 0.0009338847012259066, "8": 0.0008590070065110922, "9": 0.0007952452288009226, "10": 0.0007402951596304774, "11": 0.0006924482295289636, "12": 0.0006504105986095965, "13": 0.0006131849950179458, "14": 0.0005799898644909263, "15": 0.0005502042477019131, "16": 0.0005233284900896251, "17": 0.0004989560693502426, "18": 0.00047675275709480047, "19": 0.4037189483642578}}, {"key": "wei2020lambdanet", "year": "2020", "title": "LambdaNet: Probabilistic Type Inference using Graph Neural Networks", "topic_distr": {"0": 0.0014017524663358927, "1": 0.0011444254778325558, "2": 0.0009672327432781458, "3": 0.0008377458434551954, "4": 0.0007388558005914092, "5": 0.0006608522962778807, "6": 0.0005977463442832232, "7": 0.0005456421058624983, "8": 0.0005018931697122753, "9": 0.6984289288520813, "10": 0.0287129245698452, "11": 0.0004045776731800288, "12": 0.262896329164505, "13": 0.00035826643579639494, "14": 0.00033887146855704486, "15": 0.00032146857120096684, "16": 0.0003057658614125103, "17": 0.0002915257355198264, "18": 0.0002785529650282115, "19": 0.0002666855580173433}}, {"key": "wei2023typet5", "year": "2023", "title": "TypeT5: Seq2seq Type Inference using Static Analysis", "topic_distr": {"0": 0.0018337625078856945, "1": 0.0014975563390180469, "2": 0.05996296554803848, "3": 0.0010964531684294343, "4": 0.06000334769487381, "5": 0.0008649382507428527, "6": 0.000782343908213079, "7": 0.0007141486858017743, "8": 0.0006568891694769263, "9": 0.8681660294532776, "10": 0.0005661092582158744, "11": 0.0005295203882269561, "12": 0.0004973739269189537, "13": 0.00046890717931091785, "14": 0.000443522643763572, "15": 0.0004207453166600317, "16": 0.00040019326843321323, "17": 0.00038155546644702554, "18": 0.0003645764372777194, "19": 0.00034904410131275654}}, {"key": "white2015toward", "year": "2015", "title": "Toward Deep Learning Software Repositories", "topic_distr": {"0": 0.19447965919971466, "1": 0.0011708070524036884, "2": 0.10219664126634598, "3": 0.2728235721588135, "4": 0.0007558473153039813, "5": 0.000676049676258117, "6": 0.12927009165287018, "7": 0.0005581899895332754, "8": 0.04111673682928085, "9": 0.0004753241373691708, "10": 0.25346362590789795, "11": 0.00041388155659660697, "12": 0.0003887553757522255, "13": 0.00036650532274506986, "14": 0.0003466643684078008, "15": 0.0003288612642791122, "16": 0.0003127974341623485, "17": 0.0002982298319693655, "18": 0.000284958747215569, "19": 0.00027281843358650804}}, {"key": "white2016deep", "year": "2016", "title": "Deep Learning Code Fragments for Code Clone Detection", "topic_distr": {"0": 0.0017086126608774066, "1": 0.0013949439162388444, "2": 0.0011792093282565475, "3": 0.9873197078704834, "4": 0.0009007921325974166, "5": 0.0008056929800659418, "6": 0.0007287558983080089, "7": 0.000665231782477349, "8": 0.0006118943565525115, "9": 0.0005664750933647156, "10": 0.0005273326532915235, "11": 0.0004932499723508954, "12": 0.00046330541954375803, "13": 0.0004367885703686625, "14": 0.00041314278496429324, "15": 0.00039192562690004706, "16": 0.00037278133095242083, "17": 0.0003554201393853873, "18": 0.0003396041283849627, "19": 0.0003251357120461762}}, {"key": "white2017sorting", "year": "2017", "title": "Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities", "topic_distr": {"0": 0.001919701462611556, "1": 0.0015667334664613008, "2": 0.0013243933208286762, "3": 0.21052402257919312, "4": 0.0010116668418049812, "5": 0.0009048591600731015, "6": 0.09312552213668823, "7": 0.0007471099379472435, "8": 0.09404970705509186, "9": 0.0006361980340443552, "10": 0.0005922378622926772, "11": 0.0005539602134376764, "12": 0.0005203299806453288, "13": 0.0004905494279228151, "14": 0.0004639932594727725, "15": 0.3506460189819336, "16": 0.0004186640144325793, "17": 0.23975782096385956, "18": 0.00038140331162139773, "19": 0.00036515409010462463}}, {"key": "wong2021leveraging", "year": "2021", "title": "Leveraging Language to Learn Program Abstractions and Search Heuristics", "topic_distr": {"0": 0.0022276227828115225, "1": 0.001818799297325313, "2": 0.001537245698273182, "3": 0.0013314223615452647, "4": 0.0011742659844458103, "5": 0.001050292863510549, "6": 0.0009499987936578691, "7": 0.0008671893738210201, "8": 0.04257063567638397, "9": 0.0007384511409327388, "10": 0.0006874254322610795, "11": 0.0006429955828934908, "12": 0.0006039601867087185, "13": 0.0005693931016139686, "14": 0.0005385687109082937, "15": 0.0005109102348797023, "16": 0.00048595393309369683, "17": 0.5165290832519531, "18": 0.4247419238090515, "19": 0.0004238435940351337}}, {"key": "wu2021prototransformer", "year": "2021", "title": "ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback", "topic_distr": {"0": 0.0020120725966989994, "1": 0.0016424612840637565, "2": 0.0013885045191273093, "3": 0.0012025984469801188, "4": 0.4971470534801483, "5": 0.0009486592025496066, "6": 0.12915882468223572, "7": 0.0007832740084268153, "8": 0.0007204720750451088, "9": 0.0006669933791272342, "10": 0.0006209053099155426, "11": 0.0005807747947983444, "12": 0.0005455167265608907, "13": 0.0005142946029081941, "14": 0.00048645297647453845, "15": 0.00046147097600623965, "16": 0.10242800414562225, "17": 0.0004184877616353333, "18": 0.25789037346839905, "19": 0.00038282948662526906}}, {"key": "xia2023universal", "year": "2023", "title": "Universal Fuzzing via Large Language Models", "topic_distr": {"0": 0.0012120738392695785, "1": 0.000988708809018135, "2": 0.0008357784245163202, "3": 0.0007238824036903679, "4": 0.5550864338874817, "5": 0.2515164613723755, "6": 0.18541088700294495, "7": 0.0004714806273113936, "8": 0.0004336779238656163, "9": 0.0004014871665276587, "10": 0.0003737450751941651, "11": 0.00034958909964188933, "12": 0.00032836603350006044, "13": 0.00030957229319028556, "14": 0.0002928134344983846, "15": 0.00027777586365118623, "16": 0.00026420739595778286, "17": 0.0002519027329981327, "18": 0.00024069318897090852, "19": 0.00023043874534778297}}, {"key": "xu2019commit", "year": "2019", "title": "Commit Message Generation for Source Code Changes", "topic_distr": {"0": 0.0021511283703148365, "1": 0.0017558708786964417, "2": 0.0014842160744592547, "3": 0.0012854996602982283, "4": 0.001133758225478232, "5": 0.0010140626691281796, "6": 0.0009172281133942306, "7": 0.000837275292724371, "8": 0.0007701436406932771, "9": 0.0007129779551178217, "10": 0.0006637123879045248, "11": 0.7600171566009521, "12": 0.22394075989723206, "13": 0.00054975162493065, "14": 0.0005199905135668814, "15": 0.0004932861775159836, "16": 0.00046919070882722735, "17": 0.00044733958202414215, "18": 0.0004274331731721759, "19": 0.0004092229064553976}}, {"key": "xu2019method", "year": "2019", "title": "Method name suggestion with hierarchical attention networks", "topic_distr": {"0": 0.002078763209283352, "1": 0.001697429339401424, "2": 0.20696014165878296, "3": 0.0012426689499989152, "4": 0.0010959769133478403, "5": 0.0009802707936614752, "6": 0.000886663212440908, "7": 0.0008093746728263795, "8": 0.0007444800576195121, "9": 0.0006892192759551108, "10": 0.19299283623695374, "11": 0.000600127677898854, "12": 0.419024258852005, "13": 0.16752350330352783, "14": 0.0005026627914048731, "15": 0.0004768483340740204, "16": 0.000453555810963735, "17": 0.00043243280379101634, "18": 0.0004131897585466504, "19": 0.00039558630669489503}}, {"key": "xu2020incorporating", "year": "2020", "title": "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation", "topic_distr": {"0": 0.001783844199962914, "1": 0.0014545123558491468, "2": 0.0012297527864575386, "3": 0.0010651228949427605, "4": 0.0009394018561579287, "5": 0.0008402249077335, "6": 0.0007599905366078019, "7": 0.0449681431055069, "8": 0.0006381203420460224, "9": 0.0005907543818466365, "10": 0.0005499342805705965, "11": 0.0005143907619640231, "12": 0.00048316281754523516, "13": 0.00045550946379080415, "14": 0.0004308501956984401, "15": 0.00040872368845157325, "16": 0.00038875883910804987, "17": 0.0003706535790115595, "18": 0.5324165225028992, "19": 0.4097115993499756}}, {"key": "xu2021capturing", "year": "2021", "title": "Capturing Structural Locality in Non-parametric Language Models", "topic_distr": {"0": 0.09681197255849838, "1": 0.7136210799217224, "2": 0.0017216394189745188, "3": 0.0014911949401721358, "4": 0.00131517113186419, "5": 0.0011763233924284577, "6": 0.1751575618982315, "7": 0.0009712481405586004, "8": 0.0008933746721595526, "9": 0.0008270618855021894, "10": 0.0007699133129790425, "11": 0.0007201521075330675, "12": 0.0006764326244592667, "13": 0.000637717661447823, "14": 0.0006031944649294019, "15": 0.0005722171626985073, "16": 0.0005442661349661648, "17": 0.0005189186194911599, "18": 0.0004958269419148564, "19": 0.0004747028579004109}}, {"key": "xu2022systematic", "year": "2022", "title": "A Systematic Evaluation of Large Language Models of Code", "topic_distr": {"0": 0.7288708686828613, "1": 0.0016424567438662052, "2": 0.0013884315267205238, "3": 0.0012025750475004315, "4": 0.2580687701702118, "5": 0.0009486501221545041, "6": 0.0008580617723055184, "7": 0.0007832663250155747, "8": 0.0007204650319181383, "9": 0.0006669868598692119, "10": 0.0006208991981111467, "11": 0.0005807690904475749, "12": 0.0005455113714560866, "13": 0.0005142895970493555, "14": 0.000486448232550174, "15": 0.00046146646491251886, "16": 0.0004389252862893045, "17": 0.0004184836579952389, "18": 0.0003998613392468542, "19": 0.00038282573223114014}}, {"key": "yadavally2023partial", "year": "2023", "title": "(Partial) Program Dependence Learning", "topic_distr": {"0": 0.19330951571464539, "1": 0.0015668018022552133, "2": 0.0013243700377643108, "3": 0.0011470835888758302, "4": 0.042926762253046036, "5": 0.521549642086029, "6": 0.0008184582693502307, "7": 0.0007471150020137429, "8": 0.0006872122758068144, "9": 0.0006362023996189237, "10": 0.0005922418786212802, "11": 0.0005539639387279749, "12": 0.0005203335313126445, "13": 0.0004905527457594872, "14": 0.0004639964026864618, "15": 0.00044016767060384154, "16": 0.00041866686660796404, "17": 0.00039916872628964484, "18": 0.23104260861873627, "19": 0.00036515656393021345}}, {"key": "yadavally2024learning", "year": "2024", "title": "A Learning-Based Approach to Static Program Slicing", "topic_distr": {"0": 0.0018094503320753574, "1": 0.0014759562909603119, "2": 0.001247604377567768, "3": 0.0010805883212015033, "4": 0.10760007798671722, "5": 0.4080372750759125, "6": 0.0007710207719355822, "7": 0.0007038125186227262, "8": 0.0006473817047663033, "9": 0.1954921931028366, "10": 0.000557915773242712, "11": 0.0005218564183451235, "12": 0.0004901752108708024, "13": 0.00046212051529437304, "14": 0.0004371033573988825, "15": 0.0004146557184867561, "16": 0.00039440111140720546, "17": 0.00037603307282552123, "18": 0.27713635563850403, "19": 0.00034399222931824625}}, {"key": "yadavally2024predictive", "year": "2024", "title": "Predictive Program Slicing via Execution Knowledge-Guided Dynamic Dependence Learning", "topic_distr": {"0": 0.0018084823386743665, "1": 0.0014758072793483734, "2": 0.001247559324838221, "3": 0.0010805722558870912, "4": 0.0009530210518278182, "5": 0.0008524060831405222, "6": 0.0007710081990808249, "7": 0.0007038010517135262, "8": 0.0006473712273873389, "9": 0.0005993185914121568, "10": 0.0005579066928476095, "11": 0.0005218479200266302, "12": 0.0004901672364212573, "13": 0.0004621130065061152, "14": 0.00043709625606425107, "15": 0.0004146489663980901, "16": 0.0003943946794606745, "17": 0.9858792424201965, "18": 0.00035929391742683947, "19": 0.00034398664138279855}}, {"key": "yadid2016extracting", "year": "2016", "title": "Extracting Code from Programming Tutorial Videos", "topic_distr": {"0": 0.002228211611509323, "1": 0.0018185583176091313, "2": 0.0015372136840596795, "3": 0.0013314501848071814, "4": 0.001174278906546533, "5": 0.47927433252334595, "6": 0.13980309665203094, "7": 0.0008671994437463582, "8": 0.0007976684719324112, "9": 0.0007384596974588931, "10": 0.2132594734430313, "11": 0.0006430030334740877, "12": 0.0006039671716280282, "13": 0.0005693996790796518, "14": 0.0005385749391280115, "15": 0.0005109161720611155, "16": 0.000485959550132975, "17": 0.0004633274511434138, "18": 0.15293104946613312, "19": 0.00042384848347865045}}, {"key": "yan2020are", "year": "2020", "title": "Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries", "topic_distr": {"0": 0.27658626437187195, "1": 0.4004627764225006, "2": 0.21937832236289978, "3": 0.0008876227075234056, "4": 0.0007828456000424922, "5": 0.0007001979392953217, "6": 0.000633334566373378, "7": 0.0005781281506642699, "8": 0.000531774596311152, "9": 0.0004923024098388851, "10": 0.0004582851252052933, "11": 0.0004286651383154094, "12": 0.0004026414535474032, "13": 0.07943571358919144, "14": 0.0003590469714254141, "15": 0.01667153835296631, "16": 0.00032397033646702766, "17": 0.0003088823868893087, "18": 0.0002951372880488634, "19": 0.00028256329824216664}}, {"key": "yang2017language", "year": "2017", "title": "A Language Model for Statements of Software Code", "topic_distr": {"0": 0.0017578331753611565, "1": 0.001434587175026536, "2": 0.48504796624183655, "3": 0.001050140243023634, "4": 0.0009261802188120782, "5": 0.0008283983333967626, "6": 0.14669421315193176, "7": 0.0006839788984507322, "8": 0.0006291383178904653, "9": 0.0005824390682391822, "10": 0.25558707118034363, "11": 0.0005071503692306578, "12": 0.00047636195085942745, "13": 0.05389722064137459, "14": 0.00042478565592318773, "15": 0.0004029705887660384, "16": 0.00038328676600940526, "17": 0.04800286889076233, "18": 0.00034917460288852453, "19": 0.0003342984418850392}}, {"key": "yang2020survey", "year": "2020", "title": "A Survey on Deep Learning for Software Engineering", "topic_distr": {"0": 0.5919224619865417, "1": 0.2523423433303833, "2": 0.0010760443983599544, "3": 0.0009319893433712423, "4": 0.0008219736046157777, "5": 0.000735194596927613, "6": 0.000664989638607949, "7": 0.000607023888733238, "8": 0.0005583534948527813, "9": 0.0005169084179215133, "10": 0.0004811909166164696, "11": 0.0004500904760789126, "12": 0.00042276608292013407, "13": 0.0003985694784205407, "14": 0.0003769927134271711, "15": 0.0003576320596039295, "16": 0.0003401628928259015, "17": 0.14638878405094147, "18": 0.0003098886809311807, "19": 0.0002966862521134317}}, {"key": "yao2018staqc", "year": "2018", "title": "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow", "topic_distr": {"0": 0.001949424622580409, "1": 0.1326846480369568, "2": 0.0526033416390419, "3": 0.0011649972293525934, "4": 0.0010274848900735378, "5": 0.0009190083364956081, "6": 0.0008312505669891834, "7": 0.0007587922737002373, "8": 0.0006979532772675157, "9": 0.0006461460725404322, "10": 0.0006014984683133662, "11": 0.0005626222700811923, "12": 0.0005284662474878132, "13": 0.0797509029507637, "14": 0.0004712485824711621, "15": 0.00044704737956635654, "16": 0.0004252105427440256, "17": 0.00040540765621699393, "18": 0.00038736718124710023, "19": 0.7231371402740479}}, {"key": "yao2019coacor", "year": "2019", "title": "CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning", "topic_distr": {"0": 0.12691934406757355, "1": 0.19474069774150848, "2": 0.0011633605463430285, "3": 0.0010075921891257167, "4": 0.0008886526338756084, "5": 0.0007948343409225345, "6": 0.0007189341704361141, "7": 0.0006562662310898304, "8": 0.0006036476115696132, "9": 0.0005588405183516443, "10": 0.00052022555610165, "11": 0.0004866022209171206, "12": 0.00045706125092692673, "13": 0.00043090179678983986, "14": 0.00040757469832897186, "15": 0.00038664351450279355, "16": 0.0003677571949083358, "17": 0.00035063002724200487, "18": 0.0003350271435920149, "19": 0.6682053804397583}}, {"key": "yasunaga2020graph", "year": "2020", "title": "Graph-based, Self-Supervised Program Repair from Diagnostic Feedback", "topic_distr": {"0": 0.18378308415412903, "1": 0.001212508650496602, "2": 0.0010248133912682533, "3": 0.000887624453753233, "4": 0.0007828536326996982, "5": 0.0007002042257227004, "6": 0.0006333405035547912, "7": 0.0005781335639767349, "8": 0.0005317795439623296, "9": 0.0004923070082440972, "10": 0.00045828940346837044, "11": 0.04616251215338707, "12": 0.1780768483877182, "13": 0.00037960021290928125, "14": 0.07547497004270554, "15": 0.34291383624076843, "16": 0.00032397336326539516, "17": 0.020398804917931557, "18": 0.14490193128585815, "19": 0.0002825659466907382}}, {"key": "ye2020leveraging", "year": "2020", "title": "Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning", "topic_distr": {"0": 0.001579808071255684, "1": 0.1422143578529358, "2": 0.001089713885448873, "3": 0.0009438203414902091, "4": 0.14527226984500885, "5": 0.0007445304654538631, "6": 0.0006734341150149703, "7": 0.0006147322710603476, "8": 0.0005654438282363117, "9": 0.018210526555776596, "10": 0.0004873013822361827, "11": 0.0004558060027193278, "12": 0.0004281346336938441, "13": 0.0004036307509522885, "14": 0.10822560638189316, "15": 0.0003621735086198896, "16": 0.0003444824833422899, "17": 0.0003284392296336591, "18": 0.00031382383895106614, "19": 0.5767419934272766}}, {"key": "ye2020misim", "year": "2020", "title": "MISIM: An End-to-End Neural Code Similarity System", "topic_distr": {"0": 0.002356997923925519, "1": 0.301606684923172, "2": 0.0016242882702499628, "3": 0.2863435745239258, "4": 0.0012407383183017373, "5": 0.14175696671009064, "6": 0.0010037763277068734, "7": 0.0009162793285213411, "8": 0.0008428132277913392, "9": 0.0007802534382790327, "10": 0.0007263392908498645, "11": 0.0006793943466618657, "12": 0.0006381492130458355, "13": 0.0006016253610141575, "14": 0.0005690560210496187, "15": 0.0005398319335654378, "16": 0.0005134628154337406, "17": 0.0004895498277619481, "18": 0.2563224136829376, "19": 0.00044783655903302133}}, {"key": "ye2021neural", "year": "2021", "title": "Neural Program Repair with Execution-based Backpropagation", "topic_distr": {"0": 0.0020447976421564817, "1": 0.0016697903629392385, "2": 0.001411179662682116, "3": 0.0012222728691995144, "4": 0.0010779952863231301, "5": 0.0009641871438361704, "6": 0.000872115371748805, "7": 0.0007960949442349374, "8": 0.0007322650635614991, "9": 0.0006779109826311469, "10": 0.0006310684839263558, "11": 0.00059028115356341, "12": 0.0005544459563679993, "13": 0.0005227127694524825, "14": 0.3842763900756836, "15": 0.5692929029464722, "16": 0.0004461141361389309, "17": 0.03142198547720909, "18": 0.00040641039959155023, "19": 0.0003890957741532475}}, {"key": "ye2022selfapr", "year": "2022", "title": "SelfAPR: Self-supervised Program Repair with Test Execution Diagnostics", "topic_distr": {"0": 0.0018623180221766233, "1": 0.0015201050555333495, "2": 0.053318604826927185, "3": 0.03606921434402466, "4": 0.0009814770892262459, "5": 0.15598885715007782, "6": 0.0007940290379337966, "7": 0.0007248152396641672, "8": 0.0006667004781775177, "9": 0.0006172130815684795, "10": 0.0005745647358708084, "11": 0.0005374293541535735, "12": 0.0005048027378506958, "13": 0.0004759108123835176, "14": 0.0004501471121329814, "15": 0.7433961033821106, "16": 0.0004061705549247563, "17": 0.0003872544039040804, "18": 0.00037002176395617425, "19": 0.0003542574413586408}}, {"key": "yefet2019adversarial", "year": "2019", "title": "Adversarial Examples for Models of Code", "topic_distr": {"0": 0.002115541836246848, "1": 0.0017261960310861468, "2": 0.0014590518549084663, "3": 0.0012637133477255702, "4": 0.28569167852401733, "5": 0.59532630443573, "6": 0.0009016862022690475, "7": 0.0008230881066992879, "8": 0.000757093948777765, "9": 0.0007008969550952315, "10": 0.0006524661439470947, "11": 0.022210758179426193, "12": 0.0005732455756515265, "13": 0.08307871967554092, "14": 0.0005111795617267489, "15": 0.00048492770292796195, "16": 0.00046124053187668324, "17": 0.0004397596640046686, "18": 0.0004201905394438654, "19": 0.00040228883153758943}}, {"key": "yin2017syntactic", "year": "2017", "title": "A Syntactic Neural Model for General-Purpose Code Generation", "topic_distr": {"0": 0.0023096934892237186, "1": 0.0018863416044041514, "2": 0.001594140543602407, "3": 0.0013807554496452212, "4": 0.2756628096103668, "5": 0.0010892023565247655, "6": 0.0009851927170529962, "7": 0.0008993155206553638, "8": 0.0008272095583379269, "9": 0.0007658080430701375, "10": 0.0007128919824026525, "11": 0.0006668161950074136, "12": 0.35226970911026, "13": 0.0005904869758524001, "14": 0.12384914606809616, "15": 0.0005298375617712736, "16": 0.0005039566894993186, "17": 0.0004804864292964339, "18": 0.00045910500921308994, "19": 0.23253709077835083}}, {"key": "yin2018mining", "year": "2018", "title": "Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow", "topic_distr": {"0": 0.0016862752381712198, "1": 0.0013763702008873224, "2": 0.36613109707832336, "3": 0.0010075729805976152, "4": 0.27415576577186584, "5": 0.0007948195561766624, "6": 0.08322654664516449, "7": 0.0006562541821040213, "8": 0.0006036365521140397, "9": 0.0005588302738033235, "10": 0.0005202160100452602, "11": 0.00048659328604117036, "12": 0.09915070235729218, "13": 0.00043089388054795563, "14": 0.0004075672186445445, "15": 0.0003866364131681621, "16": 0.00036775044281966984, "17": 0.00035062359529547393, "18": 0.00033502100268378854, "19": 0.16736678779125214}}, {"key": "yin2019learning", "year": "2019", "title": "Learning to Represent Edits", "topic_distr": {"0": 0.0035642196889966726, "1": 0.002909874776378274, "2": 0.0024594333954155445, "3": 0.0021302413661032915, "4": 0.0018787819426506758, "5": 0.001680430956184864, "6": 0.001519964076578617, "7": 0.001387472148053348, "8": 0.402413010597229, "9": 0.0011814954923465848, "10": 0.0010998562211170793, "11": 0.25703537464141846, "12": 0.31524449586868286, "13": 0.0009110086830332875, "14": 0.0008616907289251685, "15": 0.0008174381800927222, "16": 0.0007775089470669627, "17": 0.0007412988343276083, "18": 0.0007083113305270672, "19": 0.0006781346164643764}}, {"key": "yin2022natural", "year": "2022", "title": "Natural Language to Code Generation in Interactive Data Science Notebooks", "topic_distr": {"0": 0.0023128469474613667, "1": 0.0018859518459066749, "2": 0.0015941780293360353, "3": 0.0013807560317218304, "4": 0.0012177738826721907, "5": 0.001089208759367466, "6": 0.000985198188573122, "7": 0.0008993205265142024, "8": 0.000827214156743139, "9": 0.0007658122922293842, "10": 0.0007128959405235946, "11": 0.0006668199202977121, "12": 0.000626338180154562, "13": 0.0005904902936890721, "14": 0.0005585238104686141, "15": 0.0005298405303619802, "16": 0.0005039595416747034, "17": 0.00048048910684883595, "18": 0.1992686539888382, "19": 0.7831037640571594}}, {"key": "yonai2019mercem", "year": "2019", "title": "Mercem: Method Name Recommendation Based on Call Graph Embedding", "topic_distr": {"0": 0.0021513875108212233, "1": 0.042011093348264694, "2": 0.12970907986164093, "3": 0.0012855399399995804, "4": 0.001133785001002252, "5": 0.0010140863014385104, "6": 0.0009172495338134468, "7": 0.0008372948504984379, "8": 0.11259926855564117, "9": 0.0007129946025088429, "10": 0.0006637278711423278, "11": 0.0006208296399563551, "12": 0.1164906695485115, "13": 0.5870864987373352, "14": 0.0005200026789680123, "15": 0.0004932977026328444, "16": 0.00046920168097130954, "17": 0.0004473500302992761, "18": 0.00042744315578602254, "19": 0.0004092324525117874}}, {"key": "yuan2017abridging", "year": "2017", "title": "Abridging Source Code", "topic_distr": {"0": 0.002547137439250946, "1": 0.3397003412246704, "2": 0.0017568833427503705, "3": 0.0015216957544907928, "4": 0.001342068426311016, "5": 0.0012003808515146375, "6": 0.001085754600353539, "7": 0.000991111621260643, "8": 0.23399227857589722, "9": 0.03695472702383995, "10": 0.0007856592419557273, "11": 0.0007348803337663412, "12": 0.0006902667228132486, "13": 0.0006507599609903991, "14": 0.0006155307637527585, "15": 0.0005839198711328208, "16": 0.0005553972441703081, "17": 0.000529531273059547, "18": 0.2998037338256836, "19": 0.07395792752504349}}, {"key": "zaremba2014learning", "year": "2014", "title": "Learning to Execute", "topic_distr": {"0": 0.09126228839159012, "1": 0.5185381770133972, "2": 0.09578394889831543, "3": 0.0015216704923659563, "4": 0.001342030125670135, "5": 0.0012003473239019513, "6": 0.0010857240995392203, "7": 0.0009910839144140482, "8": 0.0009116200380958617, "9": 0.0008439529919996858, "10": 0.000785637239459902, "11": 0.0007348597282543778, "12": 0.0006902473978698254, "13": 0.0006507417419925332, "14": 0.0006155134760774672, "15": 0.0792316198348999, "16": 0.0005553817027248442, "17": 0.000529516430106014, "18": 0.20224124193191528, "19": 0.00048439769307151437}}, {"key": "zeng2022extensive", "year": "2022", "title": "An Extensive Study on Pre-trained Models for Program Understanding and Generation", "topic_distr": {"0": 0.5122760534286499, "1": 0.0009703888208605349, "2": 0.000819886801764369, "3": 0.000710093998350203, "4": 0.29984790086746216, "5": 0.000560155778657645, "6": 0.0005066656158305705, "7": 0.00046250064042396843, "8": 0.00042541793663986027, "9": 0.00039384030969813466, "10": 0.00036662659840658307, "11": 0.0003429307253099978, "12": 0.000322111853165552, "13": 0.00030367608997039497, "14": 0.000287236412987113, "15": 0.00027248525293543935, "16": 0.0002591752272564918, "17": 0.00024710490833967924, "18": 0.18039973080158234, "19": 0.00022604972764384001}}, {"key": "zhang2019learning", "year": "2019", "title": "Learning Uniform Semantic Features for Natural Language and Programming Language Globally, Locally and Sequentially", "topic_distr": {"0": 0.0017319320468232036, "1": 0.001414298894815147, "2": 0.0011955862864851952, "3": 0.48037129640579224, "4": 0.0009133020066656172, "5": 0.0008168801432475448, "6": 0.13919250667095184, "7": 0.0006744687561877072, "8": 0.0006203907541930676, "9": 0.0005743408109992743, "10": 0.0005346548277884722, "11": 0.0005000989185646176, "12": 0.0004697385593317449, "13": 0.0004428535175975412, "14": 0.00041887941188178957, "15": 0.00039736766484566033, "16": 0.09289280325174332, "17": 0.00036035527591593564, "18": 0.0003443196474108845, "19": 0.2761339247226715}}, {"key": "zhang2019novel", "year": "2019", "title": "A Novel Neural Source Code Representation based on Abstract Syntax Tree", "topic_distr": {"0": 0.0012728316942229867, "1": 0.0010392427211627364, "2": 0.0008783949306234717, "3": 0.20618729293346405, "4": 0.0006710005691275001, "5": 0.0006001602741889656, "6": 0.24829638004302979, "7": 0.11857518553733826, "8": 0.000455799832707271, "9": 0.0004219670663587749, "10": 0.0003928098303731531, "11": 0.0003674216568470001, "12": 0.41887885332107544, "13": 0.0003253635950386524, "14": 0.0003077498695347458, "15": 0.00029194523813202977, "16": 0.00027768462314270437, "17": 0.0002647523069754243, "18": 0.00025297095999121666, "19": 0.00024219343322329223}}, {"key": "zhang2020generating", "year": "2020", "title": "Generating Adversarial Examples for Holding Robustness of Source Code Processing Models", "topic_distr": {"0": 0.4408419132232666, "1": 0.0021215020678937435, "2": 0.2757306694984436, "3": 0.001553350011818111, "4": 0.0013699878472834826, "5": 0.001225351938046515, "6": 0.2236587405204773, "7": 0.0010117292404174805, "8": 0.0009306101128458977, "9": 0.0008615333936177194, "10": 0.0008020029054023325, "11": 0.0007501677027903497, "12": 0.0007046259706839919, "13": 0.04509497433900833, "14": 0.0006283352850005031, "15": 0.0005960668786428869, "16": 0.0005669508827850223, "17": 0.000540546840056777, "18": 0.0005164927570149302, "19": 0.0004944882239215076}}, {"key": "zhang2021bag", "year": "2021", "title": "Bag-of-Words Baselines for Semantic Code Search", "topic_distr": {"0": 0.002496530767530203, "1": 0.16186168789863586, "2": 0.48848724365234375, "3": 0.001491196802817285, "4": 0.0013151677558198571, "5": 0.0011763193178921938, "6": 0.0010639907559379935, "7": 0.0009712449391372502, "8": 0.0008933717617765069, "9": 0.0008270592079497874, "10": 0.0007699108100496233, "11": 0.00072014972101897, "12": 0.000676430412568152, "13": 0.0006377155659720302, "14": 0.0006031924858689308, "15": 0.0005722152418456972, "16": 0.0005442643887363374, "17": 0.0005189168732613325, "18": 0.0004958253121003509, "19": 0.3338775634765625}}, {"key": "zhang2021disentangled.md", "year": "2021", "title": "Disentangled Code Representation Learning for Multiple Programming Languages", "topic_distr": {"0": 0.1365726888179779, "1": 0.0015669624553993344, "2": 0.16845093667507172, "3": 0.40303102135658264, "4": 0.09677258133888245, "5": 0.000904883083421737, "6": 0.0008184743346646428, "7": 0.0007471296703442931, "8": 0.0006872257799841464, "9": 0.0006362148560583591, "10": 0.000592253461945802, "11": 0.0005539748235605657, "12": 0.0005203437758609653, "13": 0.0004905623500235379, "14": 0.040778107941150665, "15": 0.00044017628533765674, "16": 0.00041867507388815284, "17": 0.0003991765552200377, "18": 0.00038141338154673576, "19": 0.1452372521162033}}, {"key": "zhang2022coditt5", "year": "2022", "title": "CoditT5: Pretraining for Source Code and Natural Language Editing", "topic_distr": {"0": 0.002714370144531131, "1": 0.0022139183711260557, "2": 0.0018714459147304296, "3": 0.0016208671731874347, "4": 0.5694743394851685, "5": 0.0012786209117621183, "6": 0.0011565230088308454, "7": 0.0010557114146649837, "8": 0.0009710658341646194, "9": 0.0008989862399175763, "10": 0.2249990701675415, "11": 0.0007827791851013899, "12": 0.0007352576940320432, "13": 0.0006931759417057037, "14": 0.0006556505104526877, "15": 0.18666765093803406, "16": 0.000591597578022629, "17": 0.0005640456802211702, "18": 0.000538945896551013, "19": 0.0005159847787581384}}, {"key": "zhang2023repocoder", "year": "2023", "title": "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation", "topic_distr": {"0": 0.0017097573727369308, "1": 0.0013949827989563346, "2": 0.05811706930398941, "3": 0.0010213805362582207, "4": 0.06785272806882858, "5": 0.0008057159138843417, "6": 0.0007287765620276332, "7": 0.02242405340075493, "8": 0.0006119117024354637, "9": 0.16421212255954742, "10": 0.3982958495616913, "11": 0.0004932639421895146, "12": 0.000463318545371294, "13": 0.0004368009394966066, "14": 0.00041315448470413685, "15": 0.00039193674456328154, "16": 0.00037279186653904617, "17": 0.00035543020931072533, "18": 0.0003396137326490134, "19": 0.2795593738555908}}, {"key": "zhao2018neural", "year": "2018", "title": "Neural-Augumented Static Analysis of Android Communication", "topic_distr": {"0": 0.2739708423614502, "1": 0.15508632361888885, "2": 0.0017933663912117481, "3": 0.001553308335132897, "4": 0.001369963982142508, "5": 0.20446664094924927, "6": 0.0011083224089816213, "7": 0.0010117122437804937, "8": 0.000930594454985112, "9": 0.3524450361728668, "10": 0.0008019894594326615, "11": 0.0007501550717279315, "12": 0.0007046141545288265, "13": 0.0006642862572334707, "14": 0.0006283247494138777, "15": 0.0005960568669252098, "16": 0.0005669413949362934, "17": 0.0005405378178693354, "18": 0.0005164840840734541, "19": 0.0004944799584336579}}, {"key": "zhao2019neural", "year": "2019", "title": "Neural Networks for Modeling Source Code Edits", "topic_distr": {"0": 0.3397161662578583, "1": 0.17913290858268738, "2": 0.001195632852613926, "3": 0.0010355598060414195, "4": 0.0009133248822763562, "5": 0.0008169015054591, "6": 0.0007388941594399512, "7": 0.000674486334901303, "8": 0.11202581971883774, "9": 0.10983293503522873, "10": 0.0005346687394194305, "11": 0.0005001119570806623, "12": 0.0004697508120443672, "13": 0.00044286507181823254, "14": 0.2501603066921234, "15": 0.0003973780258093029, "16": 0.00037796737160533667, "17": 0.00036036467645317316, "18": 0.0003443286113906652, "19": 0.0003296589129604399}}, {"key": "zhong2018generating", "year": "2018", "title": "Generating Regular Expressions from Natural Language Specifications: Are We There Yet?", "topic_distr": {"0": 0.0021159581374377012, "1": 0.0017259714659303427, "2": 0.30567237734794617, "3": 0.0012637132313102484, "4": 0.0011145437601953745, "5": 0.0009968769736588001, "6": 0.0009016832336783409, "7": 0.0008230855455622077, "8": 0.0007570915622636676, "9": 0.0007008947432041168, "10": 0.0006524640484713018, "11": 0.0006102938787080348, "12": 0.0005732437130063772, "13": 0.0005404346738941967, "14": 0.0005111779319122434, "15": 0.0004849261895287782, "16": 0.0004612390766851604, "17": 0.00043975826702080667, "18": 0.00042018922977149487, "19": 0.6792341470718384}}, {"key": "zhong2020semantic", "year": "2020", "title": "Semantic Scaffolds for Pseudocode-to-Code Generation", "topic_distr": {"0": 0.0024449536576867104, "1": 0.001996838254854083, "2": 0.5892619490623474, "3": 0.0014619670109823346, "4": 0.0012893843231722713, "5": 0.0011532583739608526, "6": 0.0010431319242343307, "7": 0.0009522043401375413, "8": 0.0008758578333072364, "9": 0.0008108452893793583, "10": 0.0007548172143287957, "11": 0.0007060317439027131, "12": 0.22471271455287933, "13": 0.0006252136081457138, "14": 0.0005913673085160553, "15": 0.0005609974032267928, "16": 0.0005335944588296115, "17": 0.0005087439203634858, "18": 0.1692507266998291, "19": 0.00046539510367438197}}, {"key": "zhou2019devign", "year": "2020", "title": "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks", "topic_distr": {"0": 0.2250344306230545, "1": 0.0011843466199934483, "2": 0.001001001917757094, "3": 0.0008669961825944483, "4": 0.0007646517478860915, "5": 0.2527797520160675, "6": 0.0006186153623275459, "7": 0.0005646920180879533, "8": 0.1361583024263382, "9": 0.0004808609082829207, "10": 0.0004476342292036861, "11": 0.00041870263521559536, "12": 0.3558594286441803, "13": 0.00037077453453093767, "14": 0.0003507024375721812, "15": 0.021916717290878296, "16": 0.00031644103000871837, "17": 0.0003017037233803421, "18": 0.00028827806818298995, "19": 0.00027599630993790925}}, {"key": "zhou2021improving", "year": "2021", "title": "Improving Code Autocompletion with Transfer Learning", "topic_distr": {"0": 0.4827761650085449, "1": 0.002078510820865631, "2": 0.0017568521434441209, "3": 0.001521642436273396, "4": 0.2160215675830841, "5": 0.0012003519805148244, "6": 0.0010857281740754843, "7": 0.0009910876397043467, "8": 0.0009116234723478556, "9": 0.0008439561352133751, "10": 0.24821384251117706, "11": 0.0007348625222221017, "12": 0.0006902499590069056, "13": 0.0006507441867142916, "14": 0.0006155158043839037, "15": 0.000583905668463558, "16": 0.0005553837399929762, "17": 0.0005295184091664851, "18": 0.03775407746434212, "19": 0.0004843994975090027}}, {"key": "zhou2022codebertscore", "year": "2023", "title": "CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code", "topic_distr": {"0": 0.515889585018158, "1": 0.0015670802677050233, "2": 0.0013243986759334803, "3": 0.0011471111793071032, "4": 0.0010116903576999903, "5": 0.0009048812789842486, "6": 0.0008184727048501372, "7": 0.0007471281569451094, "8": 0.0006872243830002844, "9": 0.000636213575489819, "10": 0.0005922522977925837, "11": 0.16091066598892212, "12": 0.0005203427281230688, "13": 0.0004905614187009633, "14": 0.00046400458086282015, "15": 0.00044017541222274303, "16": 0.00041867425898090005, "17": 0.310683012008667, "18": 0.0003814126248471439, "19": 0.0003651629958767444}}, {"key": "zhou2022docoder", "year": "2022", "title": "DocCoder: Generating Code by Retrieving and Reading Docs", "topic_distr": {"0": 0.0017355451127514243, "1": 0.0014145876048132777, "2": 0.001195641583763063, "3": 0.0010355714475736022, "4": 0.0009133360581472516, "5": 0.0008169100037775934, "6": 0.0007389019592665136, "7": 0.0006744934362359345, "8": 0.0006204133969731629, "9": 0.000574361823964864, "10": 0.0005346743855625391, "11": 0.0005001171957701445, "12": 0.0004697557305917144, "13": 0.00044286969932727516, "14": 0.0004188947204966098, "15": 0.00039738218765705824, "16": 0.0003779713297262788, "17": 0.5120353102684021, "18": 0.0003443322202656418, "19": 0.47475895285606384}}, {"key": "zhu2020ocor", "year": "2020", "title": "OCoR: An Overlapping-Aware Code Retriever", "topic_distr": {"0": 0.0015032500959932804, "1": 0.4604801535606384, "2": 0.0010371492244303226, "3": 0.000898315163794905, "4": 0.0007922732620500028, "5": 0.0007086294936016202, "6": 0.0006409613415598869, "7": 0.0005850901361554861, "8": 0.000538178370334208, "9": 0.0004982308018952608, "10": 0.00046380390995182097, "11": 0.02302638441324234, "12": 0.00040749015170149505, "13": 0.20389574766159058, "14": 0.0003633707237895578, "15": 0.00034470963873900473, "16": 0.00032787167583592236, "17": 0.000312602031044662, "18": 0.0002986913896165788, "19": 0.30287712812423706}}, {"key": "zhu2921syntax", "year": "2021", "title": "A Syntax-Guided Edit Decoder for Neural Program Repair", "topic_distr": {"0": 0.0015796252992004156, "1": 0.001289147650822997, "2": 0.0010896919993683696, "3": 0.03401162847876549, "4": 0.0008323924266733229, "5": 0.0007445125374943018, "6": 0.0006734178168699145, "7": 0.0006147174281068146, "8": 0.000565430149435997, "9": 0.0005234598065726459, "10": 0.17169968783855438, "11": 0.00045579500147141516, "12": 0.00042812430183403194, "13": 0.0004036210011690855, "14": 0.2542150020599365, "15": 0.4390137791633606, "16": 0.0003444741596467793, "17": 0.0909012034535408, "18": 0.0003138162719551474, "19": 0.0003004465252161026}}, {"key": "ziegler2022productivity", "year": "2022", "title": "Productivity Assessment of Neural Code Completion", "topic_distr": {"0": 0.18682043254375458, "1": 0.24997524917125702, "2": 0.002152211032807827, "3": 0.0018639967311173677, "4": 0.0016439828323200345, "5": 0.0014704191125929356, "6": 0.00133000616915524, "7": 0.0012140724575147033, "8": 0.0011167296906933188, "9": 0.0010338377906009555, "10": 0.3206026256084442, "11": 0.09103864431381226, "12": 0.0008455493371002376, "13": 0.0007971551385708153, "14": 0.0007540007354691625, "15": 0.0007152786711230874, "16": 0.0006803395808674395, "17": 0.0006486548227258027, "18": 0.13470341265201569, "19": 0.0005933845532126725}}, {"key": "zlotchevski2022exploring", "year": "2022", "title": "Exploring and Evaluating Personalized Models for Code Generation", "topic_distr": {"0": 0.45824170112609863, "1": 0.0014549590414389968, "2": 0.10545346140861511, "3": 0.07665299624204636, "4": 0.23398517072200775, "5": 0.0008402339299209416, "6": 0.0007599986274726689, "7": 0.0006937511498108506, "8": 0.0006381270941346884, "9": 0.0005907606682740152, "10": 0.0005499401013366878, "11": 0.000514396233484149, "12": 0.0004831679107155651, "13": 0.0004555142659228295, "14": 0.0004308547649998218, "15": 0.00040872799581848085, "16": 0.00038876294274814427, "17": 0.0003706574789248407, "18": 0.0003541634068824351, "19": 0.11673269420862198}}, {"key": "zugner2021language", "year": "2021", "title": "Language-Agnostic Representation Learning of Source Code from Structure and Context", "topic_distr": {"0": 0.0024013854563236237, "1": 0.0019585685804486275, "2": 0.0016555003821849823, "3": 0.09941122680902481, "4": 0.0012645991519093513, "5": 0.001131089637055993, "6": 0.0010230799671262503, "7": 0.0009339003008790314, "8": 0.10045094043016434, "9": 0.0007952585583552718, "10": 0.31089282035827637, "11": 0.0006924598128534853, "12": 0.47369030117988586, "13": 0.0006131952395662665, "14": 0.0005799995851702988, "15": 0.0005502134445123374, "16": 0.000523337279446423, "17": 0.0004989643930457532, "18": 0.0004767607315443456, "19": 0.00045644890633411705}}]} \ No newline at end of file diff --git a/tsne-viz.html b/tsne-viz.html index a8d66a05..60621340 100644 --- a/tsne-viz.html +++ b/tsne-viz.html @@ -1,9 +1,104 @@ ---- -layout: default -title: A Map of Publications on Machine Learning for Source Code -description: A map/visualization of the ML4Code papers. ---- -

2D Map of Papers

+ + + + + + + + + + + + + + + + A Map of Publications on Machine Learning for Source Code · Machine Learning for Big Code and Naturalness + + + + + + + + + + + + + + + + + + + + + + + + + + Contribute to ML4Code + + + + + +

+

2D Map of Papers

Each dot represents one paper in this survey. Hover your mouse over each point to look at the details. Click on a point to go to the paper information page.

Search

@@ -114,3 +209,8 @@

2D Map of Papers

}); + +

+ + + diff --git a/tsne.json b/tsne.json new file mode 100644 index 00000000..c94de2a0 --- /dev/null +++ b/tsne.json @@ -0,0 +1 @@ +[{"key": "abdelaziz2020graph4code", "year": "2020", "title": "Graph4Code: A Machine Interpretable Knowledge Graph for Code", "abstract": "

Knowledge graphs have proven extremely useful in powering diverse applications in semantic search and natural language understanding. Graph4Code is a knowledge graph about program code that can similarly power diverse applications such as program search, code understanding, refactoring, bug detection, and code automation. The graph uses generic techniques to capture the semantics of Python code: the key nodes in the graph are classes, functions and methods in popular Python modules. Edges indicate function usage (e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). We make extensive use of named graphs in RDF to make the knowledge graph extensible by the community. We describe a set of generic extraction techniques that we applied to over 1.3M Python files drawn from GitHub, over 2,300 Python modules, as well as 47M forum posts to generate a graph with over 2 billion triples. We also provide a number of initial use cases of the knowledge graph in code assistance, enforcing best practices, debugging and type inference. The graph and all its artifacts are available to the community for use.

\n", "tags": ["dataset"], "tsne_embedding": [13.78028678894043, -15.595023155212402]}, {"key": "agashe2019julce", "year": "2019", "title": "JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation", "abstract": "

Interactive programming with interleaved code snippet cells and natural language markdown is recently gaining popularity in the form of Jupyter notebooks, which accelerate prototyping and collaboration. To study code generation conditioned on a long context history, we present JuICe, a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data. Using JuICe, we train models for two tasks: (1) generation of the API call sequence in a code cell, and (2) full code cell generation, both conditioned on the NL-Code history up to a particular code cell. Experiments using current baseline code generation models show that both context and distant supervision aid in generation, and that the dataset is challenging for current systems.

\n", "tags": ["dataset", "bimodal"], "tsne_embedding": [3.0614442825317383, -0.9099875092506409]}, {"key": "aggarwal2015using", "year": "2015", "title": "Using Machine Translation for Converting Python 2 to Python 3 Code", "abstract": "

In this paper, we have tried to use Statistical machine translation in order to convert Python 2 code to Python 3 code. We use data from two projects and achieve a high BLEU score. We also investigate the cross-project training and testing to analyze the errors so as to ascertain differences with previous case. We have described a pilot study on modeling programming languages as natural language to build translation models on the lines of natural languages. This can be further worked on to translate between versions of a programming language or cross-programming-languages code translation.

\n", "tags": ["migration"], "tsne_embedding": [2.424260377883911, -23.68354606628418]}, {"key": "agrawal2023monitor", "year": "2023", "title": "Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context", "abstract": "

Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating.

\n\n

Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model.

\n\n

We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e.g., correct number of arguments to method calls), and to enforce richer semantic constraints (e.g., stateful API protocols). Our data and implementation are available at https://github.com/microsoft/monitors4codegen.

\n", "tags": ["autocomplete", "benchmark", "code completion", "code generation", "compilation", "completion", "dataset", "evaluation", "language model", "large language models", "program analysis", "static analysis", "tool"], "tsne_embedding": [16.056480407714844, 9.943403244018555]}, {"key": "ahmad2020transformer", "year": "2020", "title": "A Transformer-based Approach for Source Code Summarization", "abstract": "

Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens\u2019 position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research.

\n", "tags": ["summarization"], "tsne_embedding": [-14.828167915344238, -10.009238243103027]}, {"key": "ahmad2021unified", "year": "2021", "title": "Unified Pre-training for Program Understanding and Generation", "abstract": "

Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on language generation tasks, including code summarization, generation, translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection demonstrate PLBART\u2019s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.

\n", "tags": ["pretraining", "Transformer"], "tsne_embedding": [0.4936544895172119, -22.314661026000977]}, {"key": "ahmed2019learning", "year": "2019", "title": "Learning Lenient Parsing & Typing via Indirect Supervision", "abstract": "

Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in StackOverflow; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed & typed; this makes them more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse & type fragments, even ones with simple errors. Training a machine learner to leniently parse & type imperfect code requires a large training set of pairs of imperfect code and its repair (and/or type information); such training sets are limited by human effort and curation. In this paper, we present a novel indirectly supervised approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments (while requiring correct output) by seeding errors that mimic corruptions found in StackOverflow and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing & typing StackOverflow fragments; we also demonstrate that our approach achieves best-in-class performance on a large dataset of student errors.

\n", "tags": ["types"], "tsne_embedding": [17.066755294799805, -3.543370485305786]}, {"key": "ahmed2022learning", "year": "2022", "title": "Learning code summarization from a small and local dataset", "abstract": "

Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.

\n", "tags": ["Transformer", "summarization"], "tsne_embedding": [-10.464727401733398, -7.1744303703308105]}, {"key": "ahmed2024studying", "year": "2024", "title": "Studying LLM Performance on Closed- and Open-source Data", "abstract": "

Large Language models (LLMs) are finding wide use in software engineering practice. These models are extremely data-hungry, and are largely trained on open-source (OSS) code distributed with permissive licenses. In terms of actual use however, a great deal of software development still occurs in the for-profit/proprietary sphere, where the code under development is not, and never has been, in the public domain; thus, many developers, do their work, and use LLMs, in settings where the models may not be as familiar with the code under development. In such settings, do LLMs work as well as they do for OSS code? If not, what are the differences? When performance differs, what are the possible causes, and are there work-arounds? In this paper, we examine this issue using proprietary, closed-source software data from Microsoft, where most proprietary code is in C# and C++. We find that performance for C# changes little from OSS \u2013> proprietary code, but does significantly reduce for C++; we find that this difference is attributable to differences in identifiers. We also find that some performance degradation, in some cases, can be ameliorated efficiently by in-context learning.

\n", "tags": ["Transformers"], "tsne_embedding": [14.469216346740723, 8.71738052368164]}, {"key": "ahmed2033improving", "year": "2023", "title": "Improving Few-Shot Prompts with Relevant Static Analysis Products", "abstract": "

Large Language Models (LLM) are a new class of computation engines, \u201cprogrammed\u201d via prompt engineering. We are still learning how to best \u201cprogram\u201d these LLMs to help developers. We start with the intuition that developers tend to consciously and unconsciously have a collection of semantics facts in mind when working on coding tasks. Mostly these are shallow, simple facts arising from a quick read. For a function, examples of facts might include parameter and local variable names, return expressions, simple pre- and post-conditions, and basic control and data flow, etc.

\n\n

One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them inherently capable of doing this simple level of \u201ccode analysis\u201d and extracting such information, implicitly, while processing code: but are they, really? If they aren\u2019t, could explicitly adding this information help? Our goal here is to investigate this question, using the code summarization task and evaluate whether automatically augmenting an LLM\u2019s prompt with semantic facts explicitly, actually helps.

\n\n

Prior work shows that LLM performance on code summarization benefits from few-shot samples drawn either from the same-project or from examples found via information retrieval methods (such as BM25). While summarization performance has steadily increased since the early days, there is still room for improvement: LLM performance on code summarization still lags its performance on natural-language tasks like translation and text summarization.

\n\n

We find that adding semantic facts actually does help! This approach improves performance in several different settings suggested by prior work, including for two different Large Language Models. In most cases, improvement nears or exceeds 2 BLEU; for the PHP language in the challenging CodeSearchNet dataset, this augmentation actually yields performance surpassing 30 BLEU.

\n", "tags": ["summarization", "Transformer"], "tsne_embedding": [-0.6227796673774719, -1.4090102910995483]}, {"key": "alet2021largescale", "year": "2021", "title": "A large-scale benchmark for few-shot program induction and synthesis", "abstract": "

A landmark challenge for AI is to learn flexible, powerful representations from small numbers of examples. \nOn an important class of tasks, hypotheses in the form of programs provide extreme generalization capabilities from surprisingly few examples. However, whereas large natural few-shot learning image benchmarks have spurred progress in meta-learning for deep networks, there is no comparably big, natural program-synthesis dataset that can play a similar role. This is because, whereas images are relatively easy to label from internet meta-data or annotated by non-experts, generating meaningful input-output examples for program induction has proven hard to scale. In this work, we propose a new way of leveraging unit tests and natural inputs for small programs as meaningful input-output examples for each sub-program of the overall program. This allows us to create a large-scale naturalistic few-shot program-induction benchmark and propose new challenges in this domain. The evaluation of multiple program induction and synthesis algorithms points to shortcomings of current methods and suggests multiple avenues for future work.

\n", "tags": ["dataset", "synthesis"], "tsne_embedding": [4.9685258865356445, 7.901976585388184]}, {"key": "allal2022santacoder", "year": "2022", "title": "SantaCoder: don\u2019t reach for the stars!", "abstract": "

The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code.1 This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII)\nredaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java,\nJavaScript, and Python subsets of The Stack (Kocetkov et al., 2022) and\nevaluate the models on MultiPL-E (Cassano et al., 2022), a text2code\nbenchmark available in 18 programming languages. We find that more\naggressive filtering of near-duplicates can further boost performance and,\nsurprisingly, that selecting files from repositories with 5+ GitHub stars\ndeteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and\nCodeGen-Multi-2.7B) in both left-to-right generation and infilling on the\nJava, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL\nlicense at https://hf.co/bigcode

\n", "tags": ["Transformer"], "tsne_embedding": [0.330452561378479, 3.539247989654541]}, {"key": "allamanis2013mining", "year": "2013", "title": "Mining Source Code Repositories at Massive Scale Using Language Modeling ", "abstract": "

The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new \u201clens\u201d for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a program\u2019s core logic based solely on general information theoretic criteria.

\n", "tags": ["language model"], "tsne_embedding": [9.654191017150879, -11.278950691223145]}, {"key": "allamanis2014learning", "year": "2014", "title": "Learning Natural Coding Conventions", "abstract": "

Every programmer has a characteristic style, ranging from preferences\nabout identifier naming to preferences about object relationships and\ndesign patterns. Coding conventions define a consistent syntactic style,\nfostering readability and hence maintainability. When collaborating,\nprogrammers strive to obey a project\u2019s coding conventions. However,\none third of reviews of changes contain feedback about coding conventions,\nindicating that programmers do not always follow them and that project\nmembers care deeply about adherence. Unfortunately, programmers are\noften unaware of coding conventions because inferring them requires a\nglobal view, one that aggregates the many local decisions programmers\nmake and identifies emergent consensus on style. We present Naturalize,\na framework that learns the style of a codebase, and suggests revisions\nto improve stylistic consistency. Naturalize builds on recent work in\napplying statistical natural language processing to source code. We\napply Naturalize to suggest natural identifier names and formatting\nconventions. We present four tools focused on ensuring natural code\nduring development and release management, including code review.\nNaturalize achieves 94% accuracy in its top suggestions for identifier\nnames. We used Naturalize to generate 18 patches for 5 open source\nprojects: 14 were accepted.

\n", "tags": ["naming", "language model", "style"], "tsne_embedding": [-22.733673095703125, -12.862963676452637]}, {"key": "allamanis2014mining", "year": "2014", "title": "Mining Idioms from Source Code", "abstract": "

We present the first method for automatically mining code idioms from a corpus of previously written, idiomatic software projects. We take the view that a code idiom is a syntactic fragment that recurs across projects and has a single semantic purpose. Idioms may have metavariables, such as the body of a for loop. Modern IDEs commonly provide facilities for manually defining idioms and inserting them on demand, but this does not help programmers to write idiomatic code in languages or using libraries with which they are unfamiliar. We present Haggis, a system for mining code idioms that builds on recent advanced techniques from statistical natural language processing, namely, nonparametric Bayesian probabilistic tree substitution grammars. We apply Haggis to several of the most popular open source projects from GitHub. We present a wide range of evidence that the resulting idioms are semantically meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. Manual examination of the most common idioms indicate that they describe important program concepts, including object creation, exception handling, and resource management.

\n", "tags": ["pattern mining", "grammar", "grammar"], "tsne_embedding": [11.178251266479492, -13.368805885314941]}, {"key": "allamanis2015bimodal", "year": "2015", "title": "A Bimodal Modelling of Source Code and Natural Language", "abstract": "

We consider the problem of building probabilistic models that jointly \nmodel short natural language utterances and source code snippets. The\naim is to bring together recent work on statistical modelling of source\ncode and work on bimodal models of images and natural language. The\nresulting models are useful for a variety of tasks that involve natural\nlanguage and source code. We demonstrate their performance on two\nretrieval tasks: retrieving source code snippets given a natural language\nquery, and retrieving natural language descriptions given a source code\nquery (i.e., source code captioning). Experiments show there to be\npromise in this direction, and that modelling the structure of source\ncode improves performance.

\n", "tags": ["search", "grammar", "grammar", "bimodal"], "tsne_embedding": [-11.84818172454834, -11.528618812561035]}, {"key": "allamanis2015suggesting", "year": "2015", "title": "Suggesting Accurate Method and Class Names", "abstract": "

Descriptive names are a vital part of readable, and hence maintainable, code. Recent progress on automatically suggesting names for local variables tantalizes with the prospect of replicating that success with method and class names. However, suggesting names for methods and classes is much more difficult. This is because good method and class names need to be functionally descriptive, but suggesting such names requires that the model goes beyond local context. We introduce a neural probabilistic language model for source code that is specifically designed for the method naming problem. Our model learns which names are semantically similar by assigning them to locations, called embeddings, in a high-dimensional continuous space, in such a way that names with similar embeddings tend to be used in similar contexts. These embeddings seem to contain semantic information about tokens, even though they are learned only from statistical co-occurrences of tokens. Furthermore, we introduce a variant of our model\nthat is, to our knowledge, the first that can propose neologisms, names that have not appeared in the training corpus. We obtain state of the art results on the method, class, and even the simpler variable naming tasks. More broadly, the continuous embeddings that are learned by our model have the potential for wide application within software engineering.

\n\n", "tags": ["naming"], "tsne_embedding": [8.675830841064453, -7.6191558837890625]}, {"key": "allamanis2016convolutional", "year": "2016", "title": "A Convolutional Attention Network for Extreme Summarization of Source Code", "abstract": "

Attention mechanisms in neural networks have proved useful for problems in which\nthe input and output do not have fixed dimension. Often there exist features that\nare locally translation invariant and would be valuable for directing the model\u2019s attention,\nbut previous attentional architectures are not constructed to learn such features specifically.\nWe introduce an attentional neural network that employs convolution on the input tokens to detect\nlocal time-invariant and long-range topical attention features in a context-dependent way. We\napply this architecture to the problem of extreme summarization of source code snippets into short,\ndescriptive function name-like summaries. Using those features, the model sequentially generates a\nsummary by marginalizing over two attention mechanisms: one that predicts the next summary token based \nn the attention weights of the input tokens and another that is able to copy a code token as-is directly\ninto the summary. We demonstrate our convolutional attention neural network\u2019s performance on 10 popular Java\nprojects showing that it achieves better performance compared to previous attentional mechanisms.

\n", "tags": ["naming", "summarization"], "tsne_embedding": [-15.409921646118164, -8.32945442199707]}, {"key": "allamanis2017mining", "year": "2017", "title": "Mining Semantic Loop Idioms from Big Code", "abstract": "

During maintenance, developers spend a lot of time transforming existing code: refactoring, optimizing, and adding checks to make it more robust. Much of this work is the drudgery of identifying and replacing specific patterns, yet it resists automation, because of meaningful patterns are hard to automatically find. We present a technique for mining loop idioms, surprisingly probable semantic patterns that occur in loops, from big code to find meaningful patterns. First, we show that automatically identifiable patterns exist, in great numbers, with a large scale empirical study of loop over 25 MLOC. We find that loops in this corpus are simple and predictable: 90% of them have fewer than 15LOC and 90% have no nesting and very simple control structure. Encouraged by this result, we coil loops to abstract away syntactic diversity to define information rich loop idioms. We show that only 50 loop idioms cover 50% of the concrete loops. We show how loop idioms can help a tool developers identify and prioritize refactorings. We also show how our framework opens the door to data-driven tool and language design discovering opportunities to introduce new API calls and language constructs: loop idioms show that LINQ would benefit from an Enumerate operator, a result confirmed by the fact that precisely this feature is one of the most requested features on StackOverflow with 197 votes and 95k views.

\n", "tags": ["pattern mining", "grammar"], "tsne_embedding": [11.911873817443848, -13.301508903503418]}, {"key": "allamanis2017smartpaste", "year": "2017", "title": "SmartPaste: Learning to Adapt Source Code", "abstract": "

Deep Neural Networks have been shown to succeed at a range of natural\nlanguage tasks such as machine translation and text summarization.\nWhile tasks on source code (ie, formal languages) have been considered\nrecently, most work in this area does not attempt to capitalize on the\nunique opportunities offered by its known syntax and structure. In this\nwork, we introduce SmartPaste, a first task that requires to use such\ninformation. The task is a variant of the program repair problem that\nrequires to adapt a given (pasted) snippet of code to surrounding,\nexisting source code. As first solutions, we design a set of deep\nneural models that learn to represent the context of each variable\nlocation and variable usage in a data flow-sensitive way. Our\nevaluation suggests that our models can learn to solve the SmartPaste\ntask in many cases, achieving 58.6% accuracy, while learning meaningful\nrepresentation of variable usages.

\n", "tags": ["representation", "variable misuse"], "tsne_embedding": [-10.76509952545166, 2.330136299133301]}, {"key": "allamanis2018learning", "year": "2018", "title": "Learning to Represent Programs with Graphs", "abstract": "

Learning tasks on source code (i.e., formal languages) have been considered recently, but most work has tried to transfer natural language methods and does not capitalize on the unique opportunities offered by code\u2019s known syntax. For example, long-range dependencies induced by using the same variable or function in distant locations are often not considered. We propose to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures.

\n\n

In this work, we present how to construct graphs from source code and how to scale Gated Graph Neural Networks training to such large graphs. We evaluate our method on two tasks: VarNaming, in which a network attempts to predict the name of a variable given its usage, and VarMisuse, in which the network learns to reason about selecting the correct variable that should be used at a given program location. Our comparison to methods that use less structured program representations shows the advantages of modeling known structure, and suggests that our models learn to infer meaningful names and to solve the VarMisuse task in many cases. Additionally, our testing showed that VarMisuse identifies a number of bugs in mature open-source projects.

\n", "tags": ["naming", "GNN", "representation", "variable misuse", "defect"], "tsne_embedding": [-2.1892096996307373, 11.961808204650879]}, {"key": "allamanis2019adverse", "year": "2019", "title": "The Adverse Effects of Code Duplication in Machine Learning Models of Code", "abstract": "

The field of big code relies on mining large corpora of code to perform some learning task. A significant threat to this approach has been recently identified by Lopes et al. (2017) who found a large amount of code duplication on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this article, we study the effect of code duplication to machine learning models showing that reported metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present an \u201cerrata\u201d for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them, and release tools to help the community avoid this problem in future research.

\n", "tags": ["dataset", "evaluation"], "tsne_embedding": [5.184110164642334, -8.973413467407227]}, {"key": "allamanis2020typilus", "year": "2020", "title": "Typilus: Neural Type Hints", "abstract": "

Type inference over partial contexts in dynamically typed languages is challenging. In this work, we present a graph neural network model that predicts types by probabilistically reasoning over a program\u2019s structure, names, and patterns. The network uses deep similarity learning to learn a TypeSpace \u2013 a continuous relaxation of the discrete space of types \u2013 and how to embed the type properties of a symbol (i.e. identifier) into it. Importantly, our model can employ one-shot learning to predict an open vocabulary of types, including rare and user-defined ones. We realise our approach in Typilus for Python that combines the TypeSpace with an optional type checker. We show that Typilus accurately predicts types. Typilus confidently predicts types for 70% of all annotatable symbols; when it predicts a type, that type optionally type checks 95% of the time. Typilus can also find incorrect type annotations; two important and popular open source libraries, fairseq and allennlp, accepted our pull requests that fixed the annotation errors Typilus discovered.

\n", "tags": ["types", "GNN"], "tsne_embedding": [-4.048420429229736, 28.966259002685547]}, {"key": "allamanis2021self", "year": "2021", "title": "Self-Supervised Bug Detection and Repair", "abstract": "

Machine learning-based program analyses have recently shown the promise of integrating formal and probabilistic reasoning towards aiding software development. However, in the absence of large annotated corpora, training these analyses is challenging. Towards addressing this, we present BugLab, an approach for self-supervised learning of bug detection and repair. BugLab co-trains two models: (1) a detector model that learns to detect and repair bugs in code, (2) a selector model that learns to create buggy code for the detector to use as training data. A Python implementation of BugLab improves by up to 30% upon baseline methods on a test dataset of 2374 real-life bugs and finds 19 previously unknown bugs in open-source software.

\n", "tags": ["GNN", "Transformer", "defect", "repair"], "tsne_embedding": [23.615211486816406, 2.4611806869506836]}, {"key": "alon2018code2seq", "year": "2019", "title": "code2seq: Generating Sequences from Structured Representations of Code", "abstract": "

The ability to generate natural language sequences from source code snippets has a variety of applications such as code summarization, documentation, and retrieval. Sequence-to-sequence (seq2seq) models, adopted from neural machine translation (NMT), have achieved state-of-the-art performance on these tasks by treating source code as a sequence of tokens. We present code2seq: an alternative approach that leverages the syntactic structure of programming languages to better encode source code. Our model represents a code snippet as the set of compositional paths in its abstract syntax tree (AST) and uses attention to select the relevant paths while decoding.

\n\n

We demonstrate the effectiveness of our approach for two tasks, two programming languages, and four datasets of up to 16M examples. Our model significantly outperforms previous models that were specifically designed for programming languages, as well as general state-of-the-art NMT models. An interactive online demo of our model is available at http://code2seq.org.

\n", "tags": ["naming", "summarization", "representation"], "tsne_embedding": [-14.505969047546387, -5.979738712310791]}, {"key": "alon2018general", "year": "2018", "title": "A General Path-Based Representation for Predicting Program Properties", "abstract": "

Predicting program properties such as names or expression types has a wide range of applications. It can ease the task of programming and increase programmer productivity. A major challenge when learning from programs is how to represent programs in a way that facilitates effective learning. \nWe present a general path-based representation for learning from programs. Our representation is purely syntactic and extracted automatically. The main idea is to represent a program using paths in its abstract syntax tree (AST). This allows a learning model to leverage the structured nature of code rather than treating it as a flat sequence of tokens. \nWe show that this representation is general and can: (i) cover different prediction tasks, (ii) drive different learning algorithms (for both generative and discriminative models), and (iii) work across different programming languages. \nWe evaluate our approach on the tasks of predicting variable names, method names, and full types. We use our representation to drive both CRF-based and word2vec-based learning, for programs of four languages: JavaScript, Java, Python and C#. Our evaluation shows that our approach obtains better results than task-specific handcrafted representations across different tasks and programming languages.

\n", "tags": ["naming", "representation"], "tsne_embedding": [3.2152352333068848, 9.585824966430664]}, {"key": "alon2019code2vec", "year": "2019", "title": "code2vec: Learning Distributed Representations of Code", "abstract": "

We present a neural model for representing snippets of code as continuous distributed vectors (\u201ccode embeddings\u201d).\n The main idea is to represent a code snippet as a single fixed-length\ncode vector, which can be used to\npredict semantic properties of the snippet. To this end, code is first decomposed to a collection of paths in its\nabstract syntax tree. Then, the network learns the atomic representation of each path while\nsimultaneously\nlearning how to aggregate a set of them.

\n\n

We demonstrate the effectiveness of our approach by using it to predict a method\u2019s name from the vector\nrepresentation of its body. We evaluate our approach by training a model on a dataset of 12M methods. We\nshow that code vectors trained on this dataset can predict method names from files that were unobserved\nduring training. Furthermore, we show that our model learns useful method name vectors that capture\nsemantic similarities, combinations, and analogies.

\n\n

A comparison of our approach to previous techniques over the same dataset shows an improvement of\nmore than 75%, making it the first to successfully predict method names based on a large, cross-project\ncorpus. Our trained model, visualizations and vector similarities are available as an interactive online demo at\nhttp://code2vec.org. The code, data and trained models are available at\nhttps://github.com/tech-srl/code2vec.

\n", "tags": ["naming", "summarization", "representation"], "tsne_embedding": [3.9596080780029297, -13.030256271362305]}, {"key": "alon2019structural", "year": "2019", "title": "Structural Language Models for Any-Code Generation", "abstract": "

We address the problem of Any-Code Generation (AnyGen) - generating code without any restriction on the vocabulary or structure. The state-of-the-art in this problem is the sequence-to-sequence (seq2seq) approach, which treats code as a sequence and does not leverage any structural information. We introduce a new approach to AnyGen that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program\u2019s abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous structural techniques that have severely restricted the kinds of expressions that can be generated, our approach can generate arbitrary expressions in any programming language. Our model significantly outperforms both seq2seq and a variety of existing structured approaches in generating Java and C# code. We make our code, datasets, and models available online.

\n", "tags": ["code generation"], "tsne_embedding": [-19.922626495361328, -3.4203779697418213]}, {"key": "amodio2017neural", "year": "2017", "title": "Neural Attribute Machines for Program Generation", "abstract": "

Recurrent neural networks have achieved remarkable success at generating sequences with complex structures, thanks to advances that include richer embeddings of input and cures for vanishing gradients. Trained only on sequences from a known grammar, though, they can still struggle to learn rules and constraints of the grammar. Neural Attribute Machines (NAMs) are equipped with a logical machine that represents the underlying grammar, which is used to teach the constraints to the neural machine by (i) augmenting the input sequence, and (ii) optimizing a custom loss function. Unlike traditional RNNs, NAMs are exposed to the grammar, as well as samples from the language of the grammar. During generation, NAMs make significantly fewer violations of the constraints of the underlying grammar than RNNs trained only on samples from the language of the grammar.

\n\n", "tags": ["grammar", "code generation", "representation"], "tsne_embedding": [-23.80990219116211, 2.0070745944976807]}, {"key": "arakelyan2020towards", "year": "2020", "title": "Towards Learning Representations of Binary Executable Files for Security Tasks", "abstract": "

Tackling binary analysis problems has traditionally implied manually defining rules and heuristics. As an alternative, we are suggesting using machine learning models for learning distributed representations of binaries that can be applicable for a number of downstream tasks. We construct a computational graph from the binary executable and use it with a graph convolutional neural network to learn a high dimensional representation of the program. We show the versatility of this approach by using our representations to solve two semantically different binary analysis tasks \u2013 algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results and demonstrate improvement on the state of the art methods for both tasks.

\n", "tags": ["GNN", "representation"], "tsne_embedding": [6.587866306304932, 17.542301177978516]}, {"key": "ashwath2020predicting", "year": "2020", "title": "Predicting Vulnerability in Large Codebases With Deep Code Representation", "abstract": "

Currently, while software engineers write code for various modules, quite often, various types of errors - coding, logic, semantic, and others (most of which are not caught by compilation and other tools) get introduced. Some of these bugs might be found in the later stage of testing, and many times it is reported by customers on production code. Companies have to spend many resources, both money and time in finding and fixing the bugs which would have been avoided if coding was done right. Also, concealed flaws in software can lead to security vulnerabilities that potentially allow attackers to compromise systems and applications. Interestingly, same or similar issues/bugs, which were fixed in the past (although in different modules), tend to get introduced in production code again.\nWe developed a novel AI-based system which uses the deep representation of Abstract Syntax Tree (AST) created from the source code and also the active feedback loop to identify and alert the potential bugs that could be caused at the time of development itself i.e. as the developer is writing new code (logic and/or function). This tool integrated with IDE as a plugin would work in the background, point out existing similar functions/code-segments and any associated bugs in those functions. The tool would enable the developer to incorporate suggestions right at the time of development, rather than waiting for UT/QA/customer to raise a defect.\nWe assessed our tool on both open-source code and also on Cisco codebase for C and C++ programing language. Our results confirm that deep representation of source code and the active feedback loop is an assuring approach for predicting security and other vulnerabilities present in the code.

\n", "tags": ["grammar", "program analysis", "static analysis"], "tsne_embedding": [13.79163646697998, 10.763879776000977]}, {"key": "aye2020learning", "year": "2020", "title": "Learning Autocompletion from Real-World Datasets", "abstract": "

Code completion is a popular software development tool integrated into all major IDEs. Many neural language models have achieved promising results in completion suggestion prediction on synthetic benchmarks. However, a recent study When Code Completion Fails: a Case Study on Real-World Completions demonstrates that these results may not translate to improvements in real-world performance. To combat this effect, we train models on real-world code completion examples and find that these models outperform models trained on committed source code and working version snapshots by 12.8% and 13.8% accuracy respectively. We observe this improvement across modeling technologies and show through A/B testing that it corresponds to a 6.2% increase in programmers\u2019 actual autocompletion usage. Furthermore, our study characterizes a large corpus of logged autocompletion usages to investigate why training on real-world examples leads to stronger models.

\n", "tags": ["autocomplete"], "tsne_embedding": [-7.598304748535156, 7.473986625671387]}, {"key": "aye2020sequence", "year": "2020", "title": "Sequence Model Design for Code Completion in the Modern IDE", "abstract": "

Code completion plays a prominent role in modern integrated development environments (IDEs). Machine learning has become ubiquitous in analogous natural language writing and search software, surfacing more relevant autocompletions and search suggestions in fewer keystrokes. Prior research has reported training high-accuracy, deep neural networks for modeling source code, but little attention has been given to the practical constraints imposed by interactive developer tools. In particular, neural language models for source code modeling like the one described in Maybe Deep Neural Networks are the Best Choice for Modeling Source Code are framed around code completion, but only report accuracy of next-token prediction. However, in order for a language model (LM) to work well within real-world code completion systems, it must also always make suggestions that produce valid code that typechecks to support code completion\u2019s role in correctness-checking; return instantaneous results to help programmers code more efficiently in fewer keystrokes; and be small enough to fit comfortably on disk and in memory on developer workstations, since virtually all modern IDEs run locally and support offline usage. To meet these additional requirements, we propose a novel design for predicting top-k next tokens that combines static analysis\u2019 ability to enumerate all valid keywords and in-scope identifiers with the ability of a language model to place a probability distribution over them. Our model mixes character-level input representation with token output to represent out-of-vocabulary (OOV) tokens meaningfully and minimize prediction latency. OOV tokens can be predicted through detection of local repetition common in software. This design achieves state-of-art accuracy in source code modeling and fits the constraints imposed by real-world code completion implementations in modern IDEs.

\n", "tags": ["autocomplete"], "tsne_embedding": [-5.976065635681152, 4.511958122253418]}, {"key": "bai2021jointly", "year": "2021", "title": "Jointly Learning to Repair Code and Generate Commit Message", "abstract": "

We propose a novel task of jointly repairing program codes and generating commit messages. Code repair and commit message generation are two essential and related tasks for software development. However, existing work usually performs the two tasks independently. We construct a multilingual triple dataset including buggy code, fixed code, and commit messages for this novel task. We provide the cascaded models as baseline, which are enhanced with different training approaches, including the teacher-student method, the multi-task method, and the back-translation method. To deal with the error propagation problem of the cascaded method, the joint model is proposed that can both repair the code and generate the commit message in a unified framework. Experimental results show that the enhanced cascaded model with teacher-student method and multitask-learning method achieves the best score on different metrics of automated code repair, and the joint model behaves better than the cascaded model on commit message generation.

\n", "tags": ["edit", "Transformer"], "tsne_embedding": [19.9661922454834, -1.4955811500549316]}, {"key": "barchi2019code", "year": "2019", "title": "Code Mapping in Heterogeneous Platforms Using Deep Learning and LLVM-IR", "abstract": "

Modern heterogeneous platforms require compilers capable of choosing the appropriate device for the execution of program portions. This paper presents a machine learning method designed for supporting mapping decisions through the analysis of the program source code represented in LLVM assembly language (IR) for exploiting the advantages offered by this generalised and optimised representation. To evaluate our solution, we trained an LSTM neural network on OpenCL kernels compiled in LLVM-IR and processed with our tokenizer capable of filtering less-informative tokens. We tested the network that reaches an accuracy of 85% in distinguishing the best computational unit.

\n", "tags": ["optimization", "program analysis", "static analysis", "natural language processing"], "tsne_embedding": [3.2512338161468506, 17.502975463867188]}, {"key": "barchi2021exploration", "year": "2021", "title": "Exploration of Convolutional Neural Network models for source code classification", "abstract": "

The application of Artificial Intelligence is becoming common in many engineering fields. Among them, one of the newest and rapidly evolving is software generation, where AI can be used to automatically optimise the implementation of an algorithm for a given computing platform. In particular, Deep Learning technologies can be used to the decide how to allocate pieces of code to hardware platforms with multiple cores and accelerators, that are common in high performance and edge computing applications. In this work, we explore the use of Convolutional Neural Networks (CNN)s to analyse the application source code and decide the best compute unit to minimise the execution time. We demonstrate that CNN models can be successfully applied to source code classification, providing higher accuracy with consistently reduced learning time with respect to state-of-the-art methods. Moreover, we show the robustness of the method with respect to source code pre-processing, compiler options and hyper-parameters selection.

\n", "tags": ["optimization", "static analysis", "program analysis", "language model"], "tsne_embedding": [1.3105802536010742, 17.936918258666992]}, {"key": "barchi2022deep", "year": "2022", "title": "Deep Learning Approaches to Source Code Analysis for Optimization of Heterogeneous Systems: Recent Results, Challenges and Opportunities", "abstract": "

To cope with the increasing complexity of digital systems programming, deep learning techniques have recently been proposed to enhance software deployment by analysing source code for different purposes, ranging from performance and energy improvement to debugging and security assessment. As embedded platforms for cyber-physical systems are characterised by increasing heterogeneity and parallelism, one of the most challenging and specific problems is efficiently allocating computational kernels to available hardware resources. In this field, deep learning applied to source code can be a key enabler to face this complexity. However, due to the rapid development of such techniques, it is not easy to understand which of those are suitable and most promising for this class of systems. For this purpose, we discuss recent developments in deep learning for source code analysis, and focus on techniques for kernel mapping on heterogeneous platforms, highlighting recent results, challenges and opportunities for their applications to cyber-physical systems.

\n", "tags": ["optimization", "review"], "tsne_embedding": [2.6983368396759033, 18.476036071777344]}, {"key": "bareiss2022code", "year": "2022", "title": "Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code", "abstract": "

Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code, e.g., how to complete a given code example, or even generate code snippets from scratch. The success of these models raises the question whether they could serve as a basis for building a wide range code generation tools. Traditionally, such tools are built manually and separately for each task. Instead, few-shot learning may allow to obtain different tools from a single pre-trained language model by simply providing a few examples or a natural language description of the expected tool behavior. This paper studies to what extent a state-of-the-art, pre-trained language model of code, Codex, may serve this purpose. We consider three code manipulation and code generation tasks targeted by a range of traditional tools: (i) code mutation; (ii) test oracle generation from natural language documentation; and (iii) test case generation. For each task, we compare few-shot learning to a manually built tool. Our results show that the model-based tools complement (code mutation), are on par (test oracle generation), or even outperform their respective traditionally built tool (test case generation), while imposing far less effort to develop them. By comparing the effectiveness of different variants of the model-based tools, we provide insights on how to design an appropriate input (\u201cprompt\u201d) to the model and what influence the size of the model has. For example, we find that providing a small natural language description of the code generation task is an easy way to improve predictions. Overall, we conclude that few-shot language models are surprisingly effective, yet there is still more work to be done, such as exploring more diverse ways of prompting and tackling even more involved tasks.

\n", "tags": ["Transformer"], "tsne_embedding": [4.2617974281311035, 1.0564607381820679]}, {"key": "barke2022grounded", "year": "2022", "title": "Grounded Copilot: How Programmers Interact with Code-Generating Models", "abstract": "

Powered by recent advances in code-generating models, AI assistants like Github Copilot promise to change the face of programming forever. But what is this new face of programming? We present the first grounded theory analysis of how programmers interact with Copilot, based on observing 20 participants\u2013with a range of prior experience using the assistant\u2013as they solve diverse programming tasks across four languages. Our main finding is that interactions with programming assistants are bimodal: in acceleration mode, the programmer knows what to do next and uses Copilot to get there faster; in exploration mode, the programmer is unsure how to proceed and uses Copilot to explore their options. Based on our theory, we provide recommendations for improving the usability of future AI programming assistants.

\n", "tags": ["human evaluation", "synthesis"], "tsne_embedding": [7.630485534667969, -2.4966397285461426]}, {"key": "barone2017parallel", "year": "2017", "title": "A parallel corpus of Python functions and documentation strings for automated code documentation and code generation", "abstract": "

Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains.

\n\n

In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings (\u201cdocstrings\u201d) generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with \ndata augmentation techniques to further increase the amount of training data.

\n\n

We release our datasets and processing scripts in order to stimulate research in these areas.

\n\n", "tags": ["documentation", "summarization", "dataset"], "tsne_embedding": [-7.601117134094238, -10.433578491210938]}, {"key": "bavarian2022efficient", "year": "2022", "title": "Efficient Training of Language Models to Fill in the Middle", "abstract": "

We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive language models be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.

\n", "tags": ["Transformer", "language model"], "tsne_embedding": [-10.514466285705566, -2.352198600769043]}, {"key": "bavishi2017context2name", "year": "2017", "title": "Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts", "abstract": "

Most of the JavaScript code deployed in the wild has been minified, a process in which identifier names are replaced\nwith short, arbitrary and meaningless names. Minified code occupies less space, but also makes the code extremely difficult to manually inspect and understand. This paper presents Context2Name, a deep learning-based technique that partially reverses the effect of minification by predicting natural\nidentifier names for minified names. The core idea is to predict from the usage context of a variable a name that captures\nthe meaning of the variable. The approach combines a lightweight, token-based static analysis with an auto-encoder\nneural network that summarizes usage contexts and a recurrent neural network that predict natural names for a given\nusage context. We evaluate Context2Name\nwith a large corpus of real-world JavaScript code and show that it successfully predicts 60.4% of all minified identifiers. A comparison\nwith the state-of-the-art tools JSNice and JSNaughty shows\nthat our approach predicts 17% and 43% more names than the\nbest existing approaches, while taking only 2.6 milliseconds\nto predict a name, on average.

\n", "tags": ["naming"], "tsne_embedding": [17.89051628112793, 20.510589599609375]}, {"key": "bavishi2019autopandas", "year": "2019", "title": "AutoPandas: neural-backed generators for program synthesis", "abstract": "

Developers nowadays have to contend with a growing number of APIs. While in the long-term they are very useful to developers, many modern APIs have an incredibly steep learning curve, due to their hundreds of functions handling many arguments, obscure documentation, and frequently changing semantics. For APIs that perform data transformations, novices can often provide an I/O example demonstrating the desired transformation, but may be stuck on how to translate it to the API. A programming-by-example synthesis engine that takes such I/O examples and directly produces programs in the target API could help such novices. Such an engine presents unique challenges due to the breadth of real-world APIs, and the often-complex constraints over function arguments. We present a generator-based synthesis approach to contend with these problems. This approach uses a program candidate generator, which encodes basic constraints on the space of programs. We introduce neural-backed operators which can be seamlessly integrated into the program generator. To improve the efficiency of the search, we simply use these operators at non-deterministic decision points, instead of relying on domain-specific heuristics. We implement this technique for the Python pandas library in AutoPandas. AutoPandas supports 119 pandas dataframe transformation functions. We evaluate AutoPandas on 26 real-world benchmarks and find it solves 17 of them.

\n", "tags": ["synthesis", "GNN", "API"], "tsne_embedding": [5.271886825561523, 5.551120758056641]}, {"key": "beltramelli2017pix2code", "year": "2017", "title": "pix2code: Generating Code from a Graphical User Interface Screenshot", "abstract": "

Transforming a graphical user interface screenshot created by a designer into computer code is a typical task conducted by a developer in order to build customized software, websites and mobile applications. In this paper, we show that Deep Learning techniques can be leveraged to automatically generate code given a graphical user interface screenshot as input. Our model is able to generate code targeting three different platforms (i.e. iOS, Android and web-based technologies) from a single input image with over 77% of accuracy.

\n\n", "tags": ["code generation", "bimodal"], "tsne_embedding": [-1.403306007385254, 19.172359466552734]}, {"key": "bennun2018neural", "year": "2018", "title": "Neural Code Comprehension: A Learnable Representation of Code Semantics", "abstract": "

With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that with a single RNN architecture and pre-trained fixed embeddings, inst2vec outperforms specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art.

\n", "tags": ["representation"], "tsne_embedding": [2.0167646408081055, 11.148731231689453]}, {"key": "berabi2021tfix", "year": "2021", "title": "TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer", "abstract": "

The problem of fixing errors in programs has attracted substantial interest over the years. The\nkey challenge for building an effective code fixing tool is to capture a wide range of errors and\nmeanwhile maintain high accuracy. In this paper, we address this challenge and present a new\nlearning-based system, called TFix. TFix works\ndirectly on program text and phrases the problem of code fixing as a text-to-text task. In turn,\nthis enables it to leverage a powerful Transformer\nbased model pre-trained on natural language and\nfine-tuned to generate code fixes (via a large, high-quality dataset obtained from GitHub commits).\nTFix is not specific to a particular programming\nlanguage or class of defects and, in fact, improved\nits precision by simultaneously fine-tuning on 52\ndifferent error types reported by a popular static\nanalyzer. Our evaluation on a massive dataset of\nJavaScript programs shows that TFix is practically\neffective: it is able to synthesize code that fixes\nthe error in \u223c67 percent of cases and significantly\noutperforms existing learning-based approaches.

\n", "tags": ["repair"], "tsne_embedding": [17.9658260345459, -1.1876564025878906]}, {"key": "berabi2024deepcode", "year": "2024", "title": "DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models", "abstract": "

The automated program repair field has attracted substantial interest over the years, but despite significant research efforts, creating a system that works well for complex semantic bugs such as security vulnerabilities has proven difficult. A promising direction to solve this challenge is by leveraging large language models (LLMs), which are increasingly used to solve various programming tasks. In this paper, we investigate the effectiveness of LLMs for solving code-repair task. We show that the task is difficult as it requires the model to learn long-range code relationships, a task that inherently relies on extensive amounts of training data. At the same time, creating a large, clean dataset for complex program bugs and their corresponding fixes is non-trivial. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs. The idea is to use program analysis to limit the LLM\u2019s attention mechanism on the portions of code needed to perform the fix, drastically reducing the amount of required training data. Concretely, for training and inference, rather than feeding the entire program to the LLM, we reduce its code to a much shorter snippet that contains the reported defect together with the necessary context - and use that instead. Our evaluation shows that this code reduction approach substantially improves available models such as GPT-4 using few-shot learning, as well as fine-tuning models. To train and evaluate our system, we created a comprehensive code fixing dataset by extensively labeling 156 bug patterns (including 40 security rules), requiring complex interprocedural dataflow to discover. Our best system with Mixtral-8x7B can remove more than 80% of the reported defects while exactly matching the human fix in between 10 and 50% of cases, outperforming baselines based on GPT-3.5 and GPT-4, or based on window-based models like TFix.

\n", "tags": ["repair", "vulnerability"], "tsne_embedding": [17.92475700378418, 2.5033960342407227]}, {"key": "bhatia2016automated", "year": "2016", "title": "Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks", "abstract": "

We present a method for automatically generating repair feedback for syntax errors for introductory programming problems. Syntax errors constitute one of the largest classes of errors (34%) in our dataset of student submissions obtained from a MOOC course on edX. The previous techniques for generating automated feedback on programming assignments have focused on functional correctness and style considerations of student programs. These techniques analyze the program AST of the program and then perform some dynamic and symbolic analyses to compute repair feedback. Unfortunately, it is not possible to generate ASTs for student programs with syntax errors and therefore the previous feedback techniques are not applicable in repairing syntax errors. We present a technique for providing feedback on syntax errors that uses Recurrent neural networks (RNNs) to model syntactically valid token sequences. Our approach is inspired from the recent work on learning language models from Big Code (large code corpus). For a given programming assignment, we first learn an RNN to model all valid token sequences using the set of syntactically correct student submissions. Then, for a student submission with\nsyntax errors, we query the learnt RNN model with the prefix token sequence to predict token sequences that can fix the error by either replacing or inserting the predicted token sequence at the error location. We evaluate our technique on over 14, 000 student submissions with syntax errors. Our technique can completely repair 31.69% (4501/14203) of submissions with syntax errors and in addition partially correct 6.39% (908/14203) of the submissions.

\n", "tags": ["repair"], "tsne_embedding": [19.70320701599121, -4.037219524383545]}, {"key": "bhatia2018neurosymbolic", "year": "2018", "title": "Neuro-symbolic program corrector for introductory programming assignments", "abstract": "

Automatic correction of programs is a challenging problem with numerous real world applications in security, verification, and education. One application that is becoming increasingly important is the correction of student submissions in online courses for providing feedback. Most existing program repair techniques analyze Abstract Syntax Trees (ASTs) of programs, which are unfortunately unavailable for programs with syntax errors. In this paper, we propose a novel Neuro-symbolic approach that combines neural networks with constraint-based reasoning. Specifically, our method first uses a Recurrent Neural Network (RNN) to perform syntax repairs for the buggy programs; subsequently, the resulting syntactically-fixed programs are repaired using constraint-based techniques to ensure functional correctness. The RNNs are trained using a corpus of syntactically correct submissions for a given programming assignment, and are then queried to fix syntax errors in an incorrect programming submission by replacing or inserting the predicted tokens at the error location. We evaluate our technique on a dataset comprising of over 14,500 student submissions with syntax errors. Our method is able to repair syntax errors in 60% (8689) of submissions, and finds functionally correct repairs for 23.8% (3455) submissions.

\n", "tags": ["repair"], "tsne_embedding": [19.91767120361328, -3.9428634643554688]}, {"key": "bhoopchand2016learning", "year": "2016", "title": "Learning Python Code Suggestion with a Sparse Pointer Network", "abstract": "

To enhance developer productivity, all modern integrated development environments (IDEs) include code suggestion functionality that proposes likely next tokens at the cursor. While current IDEs work well for statically-typed languages, their reliance on type annotations means that they do not provide the same level of support for dynamic programming languages as for statically-typed languages. Moreover, suggestion engines in modern IDEs do not propose expressions or multi-statement idiomatic code. Recent work has shown that language models can improve code suggestion systems by learning from software repositories. This paper introduces a neural language model with a sparse pointer network aimed at capturing very long-range dependencies. We release a large-scale code suggestion corpus of 41M lines of Python code crawled from GitHub. On this corpus, we found standard neural language models to perform well at suggesting local phenomena, but struggle to refer to identifiers that are introduced many tokens in the past. By augmenting a neural language model with a pointer network specialized in referring to predefined classes of identifiers, we obtain a much lower perplexity and a 5 percentage points increase in accuracy for code suggestion compared to an LSTM baseline. In fact, this increase in code suggestion accuracy is due to a 13 times more accurate prediction of identifiers. Furthermore, a qualitative analysis shows this model indeed captures interesting long-range dependencies, like referring to a class member defined over 60 tokens in the past.

\n", "tags": ["language model", "autocomplete"], "tsne_embedding": [-5.998359203338623, 3.7278692722320557]}, {"key": "bian2020sinkfinder", "year": "2020", "title": "SinkFinder: harvesting hundreds of unknown interesting function pairs with just one seed", "abstract": "

Mastering the knowledge about security-sensitive functions that can potentially result in bugs is valuable to detect them. However, identifying this kind of functions is not a trivial task. Introducing machine learning-based techniques to do the task is a natural choice. Unfortunately, the approach also requires considerable prior knowledge, e.g., sufficient labelled training samples. In practice, the requirement is often hard to meet.

\n\n

In this paper, to solve the problem, we propose a novel and practical method called SinkFinder to automatically discover function pairs that we are interested in, which only requires very limited prior knowledge. SinkFinder first takes just one pair of well-known interesting functions as the initial seed to infer enough positive and negative training samples by means of sub-word word embedding. By using these samples, a support vector machine classifier is trained to identify more interesting function pairs. Finally, checkers equipped with the obtained knowledge can be easily developed to detect bugs in target systems. The experiments demonstrate that SinkFinder can successfully discover hundreds of interesting functions and detect dozens of previously unknown bugs from large-scale systems, such as Linux, OpenSSL and PostgreSQL.

\n", "tags": ["program analysis"], "tsne_embedding": [16.870281219482422, 7.561191558837891]}, {"key": "bibaev2022all", "year": "2022", "title": "All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs", "abstract": "

We propose an approach for collecting completion usage logs from the users in an IDE and using them to train a machine learning based model for ranking completion candidates.\nWe developed a set of features that describe completion candidates and their context, and deployed their anonymized collection in the Early Access Program of IntelliJ-based IDEs.\nWe used the logs to collect a dataset of code completions from users, and employed it to train a ranking CatBoost model.\nThen, we evaluated it in two settings: on a held-out set of the collected completions and in a separate A/B test on two different groups of users in the IDE.\nOur evaluation shows that using a simple ranking model trained on the past user behavior logs significantly improved code completion experience.\nCompared to the default heuristics-based ranking, our model demonstrated a decrease in the number of typing actions necessary to perform the completion in the IDE from 2.073 to 1.832.\nThe approach adheres to privacy requirements and legal constraints, since it does not require collecting personal information, performing all the necessary anonymization on the client\u2019s side.\nImportantly, it can be improved continuously: implementing new features, collecting new data, and evaluating new models - this way, we have been using it in production since the end of 2020.

\n", "tags": ["autocomplete"], "tsne_embedding": [-9.353668212890625, -14.368273735046387]}, {"key": "bichsel2016statistical", "year": "2016", "title": "Statistical Deobfuscation of Android Applications", "abstract": "

This work presents a new approach for deobfuscating Android APKs based on probabilistic learning of large code bases (termed \u201cBig Code\u201d). The key idea is to learn a probabilistic model over thousands of non-obfuscated Android applications and to use this probabilistic model to deobfuscate new, unseen Android APKs. The concrete focus of the paper is on reversing layout obfuscation, a popular transformation which renames key program elements such as classes, packages, and methods, thus making it difficult to understand what the program does. Concretely, the paper: (i) phrases the layout deobfuscation problem of Android APKs as structured prediction in a probabilistic graphical model, (ii) instantiates this model with a rich set of features and constraints that capture the Android setting, ensuring both semantic equivalence and high prediction accuracy, and (iii) shows how to leverage powerful inference and learning algorithms to achieve overall precision and scalability of the probabilistic predictions.

\n\n

We implemented our approach in a tool called DeGuard and used it to: (i) reverse the layout obfuscation performed by the popular ProGuard system on benign, open-source applications, (ii) predict third-party libraries imported by benign APKs (also obfuscated by ProGuard), and (iii) rename obfuscated program elements of Android malware. The experimental results indicate that DeGuard is practically effective: it recovers 79.1% of the program element names obfuscated with ProGuard, it predicts third-party libraries with accuracy of 91.3%, and it reveals string decoders and classes that handle sensitive data in Android malware.

\n\n", "tags": ["deobfuscation", "naming"], "tsne_embedding": [23.715848922729492, 14.545989036560059]}, {"key": "bieber2020learning", "year": "2020", "title": "Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks", "abstract": "

Graph neural networks (GNNs) have emerged as a powerful tool for learning software engineering tasks including code completion, bug finding, and program repair. They benefit from leveraging program structure like control flow graphs, but they are not well-suited to tasks like program execution that require far more sequential reasoning steps than number of GNN propagation steps. Recurrent neural networks (RNNs), on the other hand, are well-suited to long sequential chains of reasoning, but they do not naturally incorporate program structure and generally perform worse on the above tasks. Our aim is to achieve the best of both worlds, and we do so by introducing a novel GNN architecture, the Instruction Pointer Attention Graph Neural Networks (IPA-GNN), which achieves improved systematic generalization on the task of learning to execute programs using control flow graphs. The model arises by considering RNNs operating on program traces with branch decisions as latent variables. The IPA-GNN can be seen either as a continuous relaxation of the RNN model or as a GNN variant more tailored to execution. To test the models, we propose evaluating systematic generalization on learning to execute using control flow graphs, which tests sequential reasoning and use of program structure. More practically, we evaluate these models on the task of learning to execute partial programs, as might arise if using the model as a heuristic function in program synthesis. Results show that the IPA-GNN outperforms a variety of RNN and GNN baselines on both tasks.

\n", "tags": ["representation", "dynamic"], "tsne_embedding": [0.6948585510253906, 13.564172744750977]}, {"key": "bieber2022static", "year": "2022", "title": "Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions", "abstract": "

The execution behavior of a program often depends on external resources, such as program inputs or file contents, and so cannot be run in isolation. Nevertheless, software developers benefit from fast iteration loops where automated tools identify errors as early as possible, even before programs can be compiled and run. This presents an interesting machine learning challenge: can we predict runtime errors in a \u201cstatic\u201d setting, where program execution is not possible? Here, we introduce a real-world dataset and task for predicting runtime errors, which we show is difficult for generic models like Transformers. We approach this task by developing an interpreter-inspired architecture with an inductive bias towards mimicking program executions, which models exception handling and \u201clearns to execute\u201d descriptions of the contents of external resources. Surprisingly, we show that the model can also predict the location of the error, despite being trained only on labels indicating the presence/absence and kind of error. In total, we present a practical and difficult-yet-approachable challenge problem related to learning program execution and we demonstrate promising new capabilities of interpreter-inspired machine learning models for code.

\n", "tags": ["dataset", "defect"], "tsne_embedding": [9.493935585021973, 9.38499927520752]}, {"key": "bielik2016phog", "year": "2016", "title": "PHOG: Probabilistic Model for Code", "abstract": "

We introduce a new generative model for code called probabilistic higher order grammar (PHOG). PHOG generalizes probabilistic context free grammars (PCFGs) by allowing conditioning of a production rule beyond the parent non-terminal, thus capturing rich contexts relevant to programs. Even though PHOG is more powerful than a PCFG, it can be learned from data just as efficiently. We trained a PHOG model on a large JavaScript code corpus and show that it is more precise than existing models, while similarly fast. As a result, PHOG can immediately benefit existing programming tools based on probabilistic models of code.

\n", "tags": ["grammar", "code generation", "language model"], "tsne_embedding": [-19.936084747314453, -2.704740285873413]}, {"key": "bielik2020adversarial", "year": "2020", "title": "Adversarial Robustness for Code", "abstract": "

We propose a novel technique which addresses the challenge of learning accurate and robust models of code in a principled way. Our method consists of three key components: (i) learning to abstain from making a prediction if uncertain, (ii) adversarial training, and (iii) representation refinement which learns the program parts relevant for the prediction and abstracts the rest. These components are used to iteratively train multiple models, each of which learns a suitable program representation necessary to make robust predictions on a different subset of the dataset. We instantiated our approach to the task of type inference for dynamically typed languages and demonstrate its effectiveness by learning a model that achieves 88% accuracy and 84% robustness. Further, our evaluation shows that using the combination of all three components is key to obtaining accurate and robust models.

\n", "tags": ["adversarial", "types"], "tsne_embedding": [-0.7922861576080322, 27.389371871948242]}, {"key": "bouzenia2023tracefixer", "year": "2023", "title": "TraceFixer: Execution Trace-Driven Program Repair", "abstract": "

When debugging unintended program behavior, developers can often identify the point in the execution where the actual behavior diverges from the desired behavior. For example, a variable may get assigned a wrong value, which then negatively influences the remaining computation. Once a developer identifies such a divergence, how to fix the code so that it provides the desired behavior? This paper presents TraceFixer, a technique for predicting how to edit source code so that it does not diverge from the expected behavior anymore. The key idea is to train a neural program repair model that not only learns from source code edits but also exploits excerpts of runtime traces. The input to the model is a partial execution trace of the incorrect code, which can be obtained automatically through code instrumentation, and the correct state that the program should reach at the divergence point, which the user provides, e.g., in an interactive debugger. Our approach fundamentally differs from current program repair techniques, which share a similar goal but exploit neither execution traces nor information about the desired program state. We evaluate TraceFixer on single-line mistakes in Python code. After training the model on hundreds of thousands of code edits created by a neural model that mimics real-world bugs, we find that exploiting execution traces improves the bug-fixing ability by 13% to 20% (depending on the dataset, within the top-10 predictions) compared to a baseline that learns from source code edits only. Applying TraceFixer to 20 real-world Python bugs shows that the approach successfully fixes 10 of them.

\n", "tags": ["Transformer", "repair", "dynamic"], "tsne_embedding": [22.073287963867188, 5.416248321533203]}, {"key": "bouzenia2024repairagent", "year": "2024", "title": "RepairAgent: An Autonomous, LLM-Based Agent for Program Repair", "abstract": "

Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent\u2019s effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI\u2019s GPT-3.5 model, translates to 14 cents of USD per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.

\n", "tags": ["repair"], "tsne_embedding": [21.792926788330078, 2.1868948936462402]}, {"key": "brach2024can", "year": "2024", "title": "Can Large Language Model Detect Plagiarism in Source Code?", "abstract": "

The issue of code plagiarism represents a significant challenge in the academic environment. This study examines the potential of large language models (LLMs) in improving the detection of code plagiarism. The performance of several LLMs, including GPT-4o, GPT3.5 Turbo, LLaMA 3, and CodeLlama, is evaluated in comparison to conventional tools, such as JPlag, across a range of levels of code plagiarism. The findings of our study illustrate that state-of-the-art LLMs are able to outperform traditional methods, particularly in the detection of sophisticated forms of plagiarism. GPT-4o exhibited the highest overall accuracy (78.70%) and an F1 score of 86.97%. It is important to note that open-source models, such as LLaMA 3 (accuracy 71.53%, F1 score 82.75%), demonstrated the ability to detect the most complex forms of plagiarism with the same accuracy as GPT-4o. While these results demonstrate the promising potential of LLMs in code similarity analysis, it is also evident that higher false positive rates may be an inherent limitation, emphasizing the need for human oversight. This study contributes valuable insights into the application of AI in maintaining code integrity and academic honesty, paving the way for more effective, interpretable, and fair plagiarism detection systems in software development education and practice.

\n", "tags": ["code similarity", "large language models", "LLM", "plagiarism detection", "natural language processing"], "tsne_embedding": [8.326309204101562, -9.885233879089355]}, {"key": "brauckmann2020compiler", "year": "2020", "title": "Compiler-based graph representations for deep learning models of code", "abstract": "

In natural language processing, novel methods in deep learning, like recurrent neural networks (RNNs) on sequences of words, have been very successful. These methods have also been used recently for tasks in compiler optimization, like heterogeneous mapping of OpenCL kernels or predicting thread coarsening factors for optimal execution times. In contrast to natural languages, programming languages usually have a well-defined structure. This structure is what enables compilers to reason about programs on the foundations of graphs, such as abstract syntax trees (ASTs) or control-data flow graphs (CDFGs).\nIn this paper, we argue that we should use these graph structures instead of word sequences for learning compiler optimization tasks. To this end we apply recently proposed graph neural networks (GNNs) for learning predictive compiler tasks on two representations based on ASTs and CDFGs. Experimental results show how these representations improve upon the accuracy of the state-of-the-art in the task of heterogeneous OpenCL mapping, while providing orders of magnitude faster inference times, which are crucial for compiler optimizations. When testing on benchmark suites not included for training, our graph-based methods significantly outperform the state-of-the art by 12 percentage points in terms of accuracy, and are the only ones to perform better than a random mapping. When testing on the task of predicting thread coarsening factors, we expose current limitations of deep learning in compilers. We show how all of the deep learning approaches proposed so far, including our graph-based models, fail to produce an overall speedup with their predictions.

\n", "tags": ["representation", "compilation", "optimization", "GNN"], "tsne_embedding": [0.7371352314949036, 14.73369026184082]}, {"key": "brauckmann2020compy", "year": "2020", "title": "ComPy-Learn: A toolbox for exploring machine learning representations for compilers", "abstract": "

Deep Learning methods have not only shown to improve software performance in compiler heuristics, but also e.g. to improve security in vulnerability prediction or to boost developer productivity in software engineering tools. A key to the success of such methods across these use cases is the expressiveness of the representation used to abstract from the program code. Recent work has shown that different such representations have unique advantages in terms of performance. However, determining the best-performing one for a given task is often not obvious and requires empirical evaluation.\nTherefore, we present ComPy-Learn, a toolbox for conveniently defining, extracting, and exploring representations of program code. With syntax-level language information from the Clang compiler frontend and low-level information from the LLVM compiler backend, the tool supports the construction of linear and graph representations and enables an efficient search for the best-performing representation and model for tasks on program code.

\n", "tags": ["representation", "compilation", "optimization", "GNN"], "tsne_embedding": [4.786099910736084, 15.805383682250977]}, {"key": "briem2020offside", "year": "2020", "title": "OffSide: Learning to Identify Mistakes in Boundary Conditions", "abstract": "

Mistakes in boundary conditions are the cause of many bugs in software.\nThese mistakes happen when, e.g., developers make use of < or > in cases\nwhere they should have used <= or >=. Mistakes in boundary conditions\nare often hard to find and manually detecting them might be very time-consuming\nfor developers. While researchers have been proposing techniques to cope with\nmistakes in the boundaries for a long time, the automated detection of such bugs still\nremains a challenge. We conjecture that, for a tool to be able to precisely identify mistakes\nin boundary conditions, it should be able to capture the overall context of the source code\nunder analysis. In this work, we propose a deep learning model that learn mistakes in boundary\nconditions and, later, is able to identifythem in unseen code snippets. We train and test a\nmodel on over 1.5 million code snippets, with and without mistakes in different boundary conditions.\nOur model shows an accuracy from 55% up to 87%. The model is also able to detect 24 out of 41\nreal-world bugs;however, with a high false positive rate. The existing state-of-the-practice linter\ntools are not able to detect any of the bugs. We hope this paper can pave the road towards deep\nlearning models that will be able to support developers in detecting mistakes in boundary conditions.

\n", "tags": ["defect"], "tsne_embedding": [16.91876220703125, 3.2097203731536865]}, {"key": "brockschmidt2019generative", "year": "2019", "title": "Generative Code Modeling with Graphs", "abstract": "

Generative models forsource code are an interesting structured prediction problem, requiring to reason about both hard syntactic and semantic constraints as well as about natural, likely programs. We present a novel model for this problem that uses a graph to represent the intermediate state of the generated output. Our model generates code by interleaving grammar-driven expansion steps with graph augmentation and neural message passing steps. An experimental evaluation shows that our new model can generate semantically meaningful expressions, outperforming a range of strong baselines.

\n", "tags": ["grammar", "code generation", "GNN"], "tsne_embedding": [-21.4447021484375, -2.7081401348114014]}, {"key": "brody2020structural", "year": "2020", "title": "A Structural Model for Contextual Code Changes", "abstract": "

We address the problem of predicting edit completions based on a learned model that was trained on past edits. Given a code snippet that is partially edited, our goal is to predict a completion of the edit for the rest of the snippet. We refer to this task as the EditCompletion task and present a novel approach for tackling it. The main idea is to directly represent structural edits. This allows us to model the likelihood of the edit itself, rather than learning the likelihood of the edited code. We represent an edit operation as a path in the program\u2019s Abstract Syntax Tree (AST), originating from the source of the edit to the target of the edit. Using this representation, we present a powerful and lightweight neural model for the EditCompletion task. We conduct a thorough evaluation, comparing our approach to a variety of representation and modeling approaches that are driven by multiple strong models such as LSTMs, Transformers, and neural CRFs. Our experiments show that our model achieves 28% relative gain over state-of-the-art sequential models and 2\u00d7 higher accuracy than syntactic models that learn to generate the edited code instead of modeling the edits directly. Our code, dataset, and trained models are publicly available at https://github.com/tech-srl/c3po/ .

\n", "tags": ["edit", "grammar", "autocomplete"], "tsne_embedding": [-11.460317611694336, 0.40125951170921326]}, {"key": "bruch2009learning", "year": "2009", "title": "Learning from Examples to Improve Code Completion Systems", "abstract": "

The suggestions made by current IDE\u2019s code completion features are based exclusively on static type system of the programming language. As a result, often proposals are made which are irrelevant for a particular working context. Also, these suggestions are ordered alphabetically rather than by their relevance in a particular context. In this paper, we present intelligent code completion systems that learn from existing code repositories. We have implemented three such systems, each using the information contained in\nrepositories in a different way. We perform a large-scale quantitative evaluation of these systems, integrate the best performing one into Eclipse, and evaluate the latter also by a user study. Our experiments give evidence that intelligent code completion systems which learn from examples significantly outperform mainstream code completion systems in terms of the relevance of their suggestions and thus have the potential to enhance developers\u2019 productivity.

\n", "tags": ["autocomplete"], "tsne_embedding": [-9.823192596435547, -16.037429809570312]}, {"key": "buech2019learning", "year": "2019", "title": "Learning-based Recursive Aggregation of Abstract Syntax Trees for Code Clone Detection", "abstract": "

Code clone detection remains a crucial challenge in maintaining software projects. Many classic approaches rely on handcrafted aggregation schemes, while recent work uses supervised or unsupervised learning. In this work, we study several aspects of aggregation schemes for code clone detection based on supervised learning. To this aim, we implement an AST-based Recursive Neural Network. Firstly, our ablation study shows the influence of model choices and hyperparameters. We introduce error scaling as a way to effectively and efficiently address the class imbalance problem arising in code clone detection. Secondly, we study the influence of pretrained embeddings representing nodes in ASTs. We show that simply averaging all node vectors of a given AST yields strong baseline aggregation scheme. Further, learned AST aggregation schemes greatly benefit from pretrained node embeddings. Finally, we show the importance of carefully separating training and test data by clone clusters, to reliably measure generalization of models learned with supervision.

\n", "tags": ["grammar", "grammar", "clone"], "tsne_embedding": [1.5406274795532227, -8.192364692687988]}, {"key": "bui2018bilateral", "year": "2018", "title": "Bilateral Dependency Neural Networks for Cross-Language Algorithm Classification", "abstract": "

Algorithm classification is to automatically identify\nthe classes of a program based on the algorithm(s) and/or data\nstructure(s) implemented in the program. It can be useful for\nvarious tasks, such as code reuse, code theft detection, and malware detection. Code similarity metrics, on the basis of features\nextracted from syntax and semantics, have been used to classify\nprograms. Such features, however, often need manual selection\neffort and are specific to individual programming languages,\nlimiting the classifiers to programs in the same language.\nTo recognize the similarities and differences among algorithms\nimplemented in different languages, this paper describes a\nframework of Bilateral Neural Networks (Bi-NN) that builds a\nneural network on top of two underlying sub-networks, each of\nwhich encodes syntax and semantics of code in one language. A\nwhole Bi-NN can be trained with bilateral programs that implement the same algorithms and/or data structures in different\nlanguages and then be applied to recognize algorithm classes\nacross languages.

\n\n

We have instantiated the framework with several kinds of\ntoken-, tree- and graph-based neural networks that encode and\nlearn various kinds of information in code. We have applied\nthe instances of the framework to a code corpus collected from\nGitHub containing thousands of Java and C++ programs imple-\nmenting 50 different algorithms and data structures. Our evalua-\ntion results show that the use of Bi-NN indeed produces promising\nalgorithm classification results both within one language and\nacross languages, and the encoding of dependencies from code\ninto the underlying neural networks helps improve algorithm\nclassification accuracy further. In particular, our custom-built\ndependency trees with tree-based convolutional neural networks\nachieve the highest classification accuracy among the different\ninstances of the framework that we have evaluated. Our study\npoints to a possible future research direction to tailor bilateral\nand multilateral neural networks that encode more relevant\nsemantics for code learning, mining and analysis tasks

\n", "tags": ["representation"], "tsne_embedding": [-5.3919243812561035, 15.52010726928711]}, {"key": "bui2018cross", "year": "2018", "title": "Cross-Language Learning for Program Classification using Bilateral Tree-Based Convolutional Neural Networks", "abstract": "

Towards the vision of translating code that implements an algorithm from one programming language into another, this\npaper proposes an approach for automated program classification using\nbilateral tree-based convolutional neural networks\n(BiTBCNNs). It is layered on top of two tree-based\nconvolutional neural networks (TBCNNs), each of which recognizes the algorithm of code written in an individual programming language. The combination layer of the networks\nrecognizes the similarities and differences among code in different programming languages. The BiTBCNNs are trained\nusing the source code in different languages but known to\nimplement the same algorithms and/or functionalities. For\na preliminary evaluation, we use 3591 Java and 3534 C++\ncode snippets from 6 algorithms we crawled systematically\nfrom GitHub. We obtained over 90% accuracy in the cross-language binary classification task to tell whether any given\ntwo code snippets implement a same algorithm. Also, for the\nalgorithm classification task, i.e., to predict which one of the\nsix algorithm labels is implemented by an arbitrary C++ code\nsnippet, we achieved over 80% precision.

\n", "tags": ["representation", "grammar"], "tsne_embedding": [-5.510896682739258, 15.347472190856934]}, {"key": "bui2018hierarchical", "year": "2018", "title": "Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code", "abstract": "

Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may be either specific to the language grammars, or specific to certain kinds of code elements (e.g., tokens, phrases, API uses). This paper proposes a new approach to automatically learn cross-language representations for various kinds of structural code elements that may be used for program translation. Our key idea is two folded: First, we normalize and enrich code token streams with additional structural and semantic information, and train cross-language vector representations for the tokens (a.k.a. shared embeddings based on word2vec, a neural-network-based technique for producing word embeddings; Second, hierarchically from bottom up, we construct shared embeddings for code elements of higher levels of granularity (e.g., expressions, statements, methods) from the embeddings for their constituents, and then build mappings among code elements across languages based on similarities among embeddings. \nOur preliminary evaluations on about 40,000 Java and C# source files from 9 software projects show that our approach can automatically learn shared embeddings for various code elements in different languages and identify their cross-language mappings with reasonable Mean Average Precision scores. When compared with an existing tool for mapping library API methods, our approach identifies many more mappings accurately. The mapping results and code can be accessed at this https URL. We believe that our idea for learning cross-language vector representations with code structural information can be a useful step towards automated program translation.

\n", "tags": ["representation"], "tsne_embedding": [4.7911200523376465, -16.641225814819336]}, {"key": "bui2019learning", "year": "2019", "title": "SAR: Learning Cross-Language API Mappings with Little Knowledge", "abstract": "

To save manual effort, developers often translate programs from one programming language to another, instead of implementing it from scratch. Translating application program interfaces (APIs) used in one language to functionally equivalent ones available in another language is an important aspect of program translation. Existing approaches facilitate the translation by automatically identifying the API mappings across programming languages. However, all these approaches still require large amount of manual effort in preparing parallel program corpora, ranging from pairs of APIs, to manually identified code in different languages that are considered as functionally equivalent. To minimize the manual effort in identifying parallel program corpora and API mappings, this paper aims at an automated approach to map APIs across languages with much less knowledge a priori needed than other existing approaches. The approach is based on an realization of the notion of domain adaption combined with code embedding, which can better align two vector spaces: taking as input large sets of programs, our approach first generates numeric vector representations of the programs, especially the APIs used in each language, and it adapts generative adversarial networks (GAN) to align the vectors from the spaces of two languages. For a better alignment, we initialize the GAN with parameters derived from optional API mapping seeds that can be identified accurately with a simple automatic signature-based matching heuristic. Then the cross-language API mappings can be identified via nearest-neighbors queries in the aligned vector spaces.

\n", "tags": ["representation", "API"], "tsne_embedding": [4.915628910064697, -17.140378952026367]}, {"key": "bui2021efficient", "year": "2021", "title": "Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations", "abstract": "

We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.

\n", "tags": ["pretraining", "search"], "tsne_embedding": [-13.38041877746582, -9.8640775680542]}, {"key": "bui2021infercode", "year": "2021", "title": "InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees", "abstract": "

Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other tasks. While some techniques produce representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. Although certain techniques generate representations from unlabeled code when applied to downstream tasks they are far from satisfactory. This paper proposes InferCode to overcome the limitation by adapting the self-supervised learning mechanism to build source code model. The key novelty lies in training code representations by predicting automatically identified subtrees from the context of the ASTs. Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We trained an InferCode model instance using the Tree-based CNN as the encoder of a large set of Java code and applied it to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search or reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model with a significant margin for most tasks including those involving different programming languages.

\n", "tags": ["representation"], "tsne_embedding": [-5.988785266876221, -1.168887972831726]}, {"key": "cai2020tag", "year": "2020", "title": "TAG : Type Auxiliary Guiding for Code Comment Generation", "abstract": "

Existing leading code comment generation approaches with the structure-to-sequence framework ignores the type information of the interpretation of the code, e.g., operator, string, etc. However, introducing the type information into the existing framework is non-trivial due to the hierarchical dependence among the type information. In order to address the issues above, we propose a Type Auxiliary Guiding encoder-decoder framework for the code comment generation task which considers the source code as an N-ary tree with type information associated with each node. Specifically, our framework is featured with a Type-associated Encoder and a Type-restricted Decoder which enables adaptive summarization of the source code. We further propose a hierarchical reinforcement learning method to resolve the training difficulties of our proposed framework. Extensive evaluations demonstrate the state-of-the-art performance of our framework with both the auto-evaluated metrics and case studies.

\n", "tags": ["bimodal", "documentation"], "tsne_embedding": [-14.917739868164062, -5.047293663024902]}, {"key": "cambronero2019deep", "year": "2019", "title": "When Deep Learning Met Code Search", "abstract": "

There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural language queries, into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including unsupervised techniques, which rely only on a corpus of code examples, and supervised techniques, which use an aligned corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet.

\n\n

Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a minimal supervision extension to an existing unsupervised technique.

\n\n

Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus.

\n", "tags": ["search"], "tsne_embedding": [-1.925971269607544, -15.754229545593262]}, {"key": "campbell2014syntax", "year": "2014", "title": "Syntax Errors Just Aren\u2019t Natural: Improving Error Reporting with Language Models", "abstract": "

A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in\nmany errors reported throughout the file. We seek to find the actual source of these syntax errors by relying on the consistency of software: valid source code is usually repetitive and unsurprising. We exploit this consistency by constructing a simple N-gram language model of lexed source code tokens. We implemented an automatic Java syntax-error locator using the corpus of the project itself and evaluated its performance on mutated source code from several projects. Our tool, trained on the past versions of a project, can effectively augment the syntax error locations produced by the native compiler. Thus we provide a methodology and tool that exploits the naturalness of software source code to detect syntax errors alongside the parser.

\n", "tags": ["repair", "language model"], "tsne_embedding": [16.999675750732422, -5.0913848876953125]}, {"key": "casey2024survey", "year": "2024", "title": "A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks", "abstract": "

Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what\u2019s not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.

\n", "tags": ["survey", "cybersecurity", "vulnerability"], "tsne_embedding": [7.779167652130127, 16.420164108276367]}, {"key": "cassano2023can", "year": "2023", "title": "Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions", "abstract": "

A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing explanations of code. In contrast, the behavior of instructional code editing with LLMs is understudied. These are tasks in which the model is provided a block of code and an instruction to modify the code. The editing instruction may ask for a feature to be added or removed, describe a bug and ask for a fix, or ask for a different kind of solution. We introduce a carefully crafted benchmark of code editing tasks and use it to evaluate several cutting edge LLMs. Our evaluation exposes a significant gap between the capabilities of state-of-the-art open and closed models. For example, even GPT-3.5-Turbo is better than the best open model at code editing tasks. We also introduce a new, carefully curated, permissively licensed training dataset of code editing tasks coupled with natural language instructions. Using this training dataset, we show that we can fine-tune open Code LLMs to significantly improve their code editing capabilities, closing the gap between open and closed models. All code, data, and models are available at https://github.com/nuprl/CanItEdit.

\n", "tags": ["editing"], "tsne_embedding": [-0.9450193047523499, 1.4730077981948853]}, {"key": "cerulo2013hidden", "year": "2013", "title": "A Hidden Markov Model to Detect Coded Information Islands in Free Text", "abstract": "

Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of\nsource code and natural language, unstructured text.

\n\n

In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens \u2014 e.g., words, language keywords, numbers, parentheses, punctuation marks, etc. \u2014 observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language.

\n\n

We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.

\n\n", "tags": ["information extraction"], "tsne_embedding": [-4.502639293670654, -26.344722747802734]}, {"key": "cerulo2015irish", "year": "2015", "title": "Irish: A Hidden Markov Model to detect coded information islands in free text", "abstract": "

Developers\u2019 communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can\nbe used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers\u2019 communication can be useful to support\nseveral software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts.

\n\n

We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74% and 99%; this is in line with existing approaches which, differently from Irish, require specific expertise for the definition of regular expressions or grammars.

\n\n", "tags": ["information extraction"], "tsne_embedding": [-4.497921466827393, -26.341266632080078]}, {"key": "chae2016automatically", "year": "2016", "title": "Automatically generating features for learning program analysis heuristics", "abstract": "

We present a technique for automatically generating features for data-driven program analyses. Recently data-driven approaches for building a program analysis have been proposed, which mine existing codebases and automatically learn heuristics for finding a cost-effective abstraction for a given analysis task. Such approaches reduce the burden of the analysis designers, but they do not remove it completely; they still leave the highly nontrivial task of designing so called features to the hands of the designers. Our technique automates this feature design process. The idea is to use programs as features after reducing and abstracting them. Our technique goes through selected program-query pairs in codebases, and it reduces and abstracts the program in each pair to a few lines of code, while ensuring that the analysis behaves similarly for the original and the new programs with respect to the query. Each reduced program serves as a boolean feature for program-query pairs. This feature evaluates to true for a given program-query pair when (as a program) it is included in the program part of the pair. We have implemented our approach for three real-world program analyses. Our experimental evaluation shows that these analyses with automatically-generated features perform comparably to those with manually crafted features.

\n", "tags": ["representation"], "tsne_embedding": [20.91889762878418, 12.054704666137695]}, {"key": "chakraborty2018tree2tree", "year": "2018", "title": "CODIT: Code Editing with Tree-Based Neural Machine Translation", "abstract": "

The way developers edit day-to-day code tends to be repetitive, often using existing code elements. Many researchers have tried to automate repetitive code changes by learning from specific change templates which are applied to limited scope. The advancement of Neural Machine Translation (NMT) and the availability of vast open-source evolutionary data opens up the possibility of automatically learning those templates from the wild. However, unlike natural languages, for which NMT techniques were originally devised, source code and its changes have certain properties. For instance, compared to natural language, source code vocabulary can be significantly larger. Further, good changes in code do not break its syntactic structure. Thus, deploying state-of-the-art NMT models without adapting the methods to the source code domain yields sub-optimal results. To this end, we propose a novel Tree based NMT system to model source code changes and learn code change patterns from the wild. We realize our model with a change suggestion engine: CODIT and train the model with more than 30k real-world changes and evaluate it on 6k patches. Our evaluation shows the effectiveness of CODIT in learning and suggesting patches.CODIT also shows promise generating bug fix patches.

\n", "tags": ["grammar", "grammar", "repair", "code generation"], "tsne_embedding": [-13.869034767150879, 2.5546603202819824]}, {"key": "chakraborty2020deep", "year": "2021", "title": "Deep Learning based Vulnerability Detection: Are We There Yet?", "abstract": "

Automated detection of software vulnerabilities is a fundamental problem in software security. Existing program analysis techniques either suffer from high false positives or false negatives. Recent progress in Deep Learning (DL) has resulted in a surge of interest in applying DL for automated vulnerability detection. Several recent studies have demonstrated promising results achieving an accuracy of up to 95% at detecting vulnerabilities. In this paper, we ask, \u201chow well do the state-of-the-art DL-based techniques perform in a real-world vulnerability prediction scenario?\u201d. To our surprise, we find that their performance drops by more than 50%. A systematic investigation of what causes such precipitous performance drop reveals that existing DL-based vulnerability prediction approaches suffer from challenges with the training data (e.g., data duplication, unrealistic distribution of vulnerable classes, etc.) and with the model choices (e.g., simple token-based models). As a result, these approaches often do not learn features related to the actual cause of the vulnerabilities. Instead, they learn unrelated artifacts from the dataset (e.g., specific variable/function names, etc.). Leveraging these empirical findings, we demonstrate how a more principled approach to data collection and model design, based on realistic settings of vulnerability prediction, can lead to better solutions. The resulting tools perform significantly better than the studied baseline: up to 33.57% boost in precision and 128.38% boost in recall compared to the best performing model in the literature. Overall, this paper elucidates existing DL-based vulnerability prediction systems\u2019 potential issues and draws a roadmap for future DL-based vulnerability prediction research. In that spirit, we make available all the artifacts supporting our results: https://git.io/Jf6IA

\n", "tags": ["defect", "survey"], "tsne_embedding": [8.100634574890137, 19.442554473876953]}, {"key": "chakraborty2021multimodal", "year": "2021", "title": "On Multi-Modal Learning of Editing Source Code", "abstract": "

In recent years, Neural Machine Translator (NMT) has shown promise in automatically editing source code. Typical NMT based code editor only considers the code that needs to be changed as input and suggests developers with a ranked list of patched code to choose from - where the correct one may not always be at the top of the list. While NMT based code editing systems generate a broad spectrum of plausible patches, the correct one depends on the developers\u2019 requirement and often on the context where the patch is applied. Thus, if developers provide some hints, using natural language, or providing patch context, NMT models can benefit from them. As a proof of concept, in this research, we leverage three modalities of information: edit location, edit code context, commit messages (as a proxy of developers\u2019 hint in natural language) to automatically generate edits with NMT models. To that end, we build MODIT, a multi-modal NMT based code editing engine. With in-depth investigation and analysis, we show that developers\u2019 hint as an input modality can narrow the search space for patches and outperform state-of-the-art models to generate correctly patched code in top-1 position.

\n", "tags": ["Transformer", "edit"], "tsne_embedding": [-12.831646919250488, 2.364220142364502]}, {"key": "chen2019capturing", "year": "2019", "title": "Capturing source code semantics via tree-based convolution over API-enhanced AST", "abstract": "

When deep learning meets big code, a key question is how to efficiently learn a distributed representation for source code that can capture its semantics effectively. We propose to use tree-based convolution over API-enhanced AST. To demonstrate the effectiveness of our approach, we apply it to detect semantic clones\u2014code fragments with similar semantics but dissimilar syntax. Experiment results show that our approach outperforms an existing state-of-the-art approach that uses tree-based LSTM, with an increase of 0.39 and 0.12 in F1-score on OJClone and BigCloneBench respectively. We further propose architectures that incorporate our approach for code search and code summarization.

\n", "tags": ["grammar", "representation"], "tsne_embedding": [0.42834004759788513, -7.432295322418213]}, {"key": "chen2019literature", "year": "2019", "title": "A Literature Study of Embeddings on Source Code", "abstract": "

Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and discuss the usage of word embedding techniques on programs and source code. The articles in this survey have been collected by asking authors of related work and with an extensive search on Google Scholar. Each article is categorized into five categories: 1. embedding of tokens 2. embedding of functions or methods 3. embedding of sequences or sets of method calls 4. embedding of binary code 5. other embeddings. We also provide links to experimental data and show some remarkable visualization of code embeddings. In summary, word embedding has been successfully applied on different granularities of source code. With access to countless open-source repositories, we see a great potential of applying other data-driven natural language processing techniques on source code in the future.

\n", "tags": ["representation"], "tsne_embedding": [4.397859573364258, -14.812005043029785]}, {"key": "chen2019mining", "year": "2019", "title": "Mining Likely Analogical APIs across Third-Party Libraries via Large-Scale Unsupervised API Semantics Embedding", "abstract": "

Establishing API mappings between third-party libraries is a prerequisite step for library migration tasks. Manually establishing API mappings is tedious due to the large number of APIs to be examined. Having an automatic technique to create a database of likely API mappings can significantly ease the task. Unfortunately, existing techniques either adopt supervised learning mechanism that requires already-ported or functionality similar applications across major programming languages or platforms, which are difficult to come by for an arbitrary pair of third-party libraries, or cannot deal with lexical gap in the API descriptions of different libraries. To overcome these limitations, we present an unsupervised deep learning based approach to embed both API usage semantics and API description (name and document) semantics into vector space for inferring likely analogical API mappings between libraries. Based on deep learning models trained using tens of millions of API call sequences, method names and comments of 2.8 millions of methods from 135,127 GitHub projects, our approach significantly outperforms other deep learning or traditional information retrieval (IR) methods for inferring likely analogical APIs. We implement a proof-of-concept website which can recommend analogical APIs for 583,501 APIs of 111 pairs of analogical Java libraries with diverse functionalities. This scale of third-party analogical-API database has never been achieved before.

\n", "tags": ["API", "representation"], "tsne_embedding": [6.466467380523682, -16.57985496520996]}, {"key": "chen2019sequencer", "year": "2019", "title": "SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair", "abstract": "

This paper presents a novel end-to-end approach to program repair based on sequence-to-sequence learning. We devise, implement, and evaluate a system, called SequenceR, for fixing bugs based on sequence-to-sequence learning on source code. This approach uses the copy mechanism to overcome the unlimited vocabulary problem that occurs with big code. Our system is data-driven; we train it on 35,578 commits, carefully curated from open-source repositories. We evaluate it on 4,711 independent real bug fixes, as well on the Defects4J benchmark used in program repair research. SequenceR is able to perfectly predict the fixed line for 950/4711 testing samples. It captures a wide range of repair operators without any domain-specific top-down design.

\n", "tags": ["repair", "code generation"], "tsne_embedding": [20.54841423034668, 1.8379859924316406]}, {"key": "chen2021evaluating", "year": "2021", "title": "Evaluating Large Language Models Trained on Code", "abstract": "

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

\n", "tags": ["language model", "synthesis"], "tsne_embedding": [2.027163505554199, 2.645599842071533]}, {"key": "chen2021plur", "year": "2021", "title": "PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair", "abstract": "

Machine learning for understanding and editing source code has recently attracted significant interest, with many developments in new models, new code representations, and new tasks.This proliferation can appear disparate and disconnected, making each approach seemingly unique and incompatible, thus obscuring the core machine learning challenges and contributions.In this work, we demonstrate that the landscape can be significantly simplified by taking a general approach of mapping a graph to a sequence of tokens and pointers.Our main result is to show that 16 recently published tasks of different shapes can be cast in this form, based on which a single model architecture achieves near or above state-of-the-art results on nearly all tasks, outperforming custom models like code2seq and alternative generic models like Transformers.This unification further enables multi-task learning and a series of cross-cutting experiments about the importance of different modeling choices for code understanding and repair tasks.The full framework, called PLUR, is easily extensible to more tasks, and will be open-sourced (https://github.com/google-research/plur).

\n", "tags": ["repair"], "tsne_embedding": [-1.4295850992202759, 10.289177894592285]}, {"key": "chen2022codet", "year": "2022", "title": "CodeT: Code Generation with Generated Tests", "abstract": "

Given a programming problem, pre-trained language models such as Codex have demonstrated the ability to generate multiple different code solutions via sampling. However, selecting a correct or best solution from those samples still remains a challenge. While an easy way to verify the correctness of a code solution is through executing test cases, producing high-quality test cases is prohibitively expensive. In this paper, we explore the use of pre-trained language models to automatically generate test cases, calling our method CodeT: Code generation with generated Tests. CodeT executes the code solutions using the generated test cases, and then chooses the best solution based on a dual execution agreement with both the generated test cases and other generated solutions. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks. Extensive experimental results demonstrate CodeT can achieve significant, consistent, and surprising improvements over previous methods. For example, CodeT improves the pass@1 on HumanEval to 65.8%, an increase of absolute 18.8% on the code-davinci-002 model, and an absolute 20+% improvement over previous state-of-the-art results.

\n", "tags": ["synthesis", "Transformer", "execution"], "tsne_embedding": [6.871870517730713, -0.0501372367143631]}, {"key": "chen2022learning.md", "year": "2022", "title": "Learning to Reverse DNNs from AI Programs Automatically", "abstract": "

With the privatization deployment of DNNs on edge devices, the security of on-device DNNs has raised significant concern. To quantify the model leakage risk of on-device DNNs automatically, we propose NNReverse, the first learning-based method which can reverse DNNs from AI programs without domain knowledge. NNReverse trains a representation model to represent the semantics of binary code for DNN layers. By searching the most similar function in our database, NNReverse infers the layer type of a given function\u2019s binary code. To represent assembly instructions semantics precisely, NNReverse proposes a more finegrained embedding model to represent the textual and structural-semantic of assembly functions.

\n", "tags": ["Reverse Engineering", "Binary Code"], "tsne_embedding": [11.872802734375, 17.202865600585938]}, {"key": "chen2023diversevul", "year": "2023", "title": "DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection", "abstract": "

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection.\nCombining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models.\nHowever, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

\n", "tags": ["dataset", "Transformer", "vulnerability"], "tsne_embedding": [8.333921432495117, 18.840097427368164]}, {"key": "chen2023supersonic", "year": "2023", "title": "Supersonic: Learning to Generate Source Code Optimizations in C/C++", "abstract": "

Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic\u2019s performance is benchmarked against OpenAI\u2019s GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.

\n", "tags": ["optimization"], "tsne_embedding": [5.684123992919922, 10.923129081726074]}, {"key": "chen2024ppm.md", "year": "2024", "title": "PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models", "abstract": "

In recent times, a plethora of Large Code Generation Models (LCGMs) have been proposed, showcasing significant potential in assisting developers with complex programming tasks. Benchmarking LCGMs necessitates the creation of a set of diverse programming problems, and each problem comprises the prompt (including the task description), canonical solution, and test inputs. The existing methods for constructing such a problem set can be categorized into two main types: manual methods and perturbation-based methods. However, manual methods demand high effort and lack scalability, while also risking data integrity due to LCGMs\u2019 potentially contaminated data collection, and perturbation-based approaches mainly generate semantically homogeneous problems with the same canonical solutions and introduce typos that can be easily auto-corrected by IDE, making them ineffective and unrealistic. In this work, we propose the idea of programming problem merging (PPM) and provide two implementation of this idea, we utilize our tool on two widely-used datasets and compare it against nine baseline methods using eight code generation models. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems, comparing to the baselines.

\n", "tags": ["benchmarking", "evaluation"], "tsne_embedding": [5.269049644470215, 0.7609741687774658]}, {"key": "chibotaru2019scalable", "year": "2019", "title": "Scalable Taint Specification Inference with Big Code", "abstract": "

We present a new scalable, semi-supervised method for inferring\ntaint analysis specifications by learning from a large dataset of programs.\nTaint specifications capture the role of library APIs (source, sink, sanitizer)\nand are a critical ingredient of any taint analyzer that aims to detect\nsecurity violations based on information flow.

\n\n

The core idea of our method\nis to formulate the taint specification learning problem as a linear\noptimization task over a large set of information flow constraints.\nThe resulting constraint system can then be efficiently solved with\nstate-of-the-art solvers. Thanks to its scalability, our method can infer\nmany new and interesting taint specifications by simultaneously learning from\na large dataset of programs (e.g., as found on GitHub), while requiring \nfew manual annotations.

\n\n

We implemented our method in an end-to-end system,\ncalled Seldon, targeting Python, a language where static specification\ninference is particularly hard due to lack of typing information.\nWe show that Seldon is practically effective: it learned almost 7,000 API\nroles from over 210,000 candidate APIs with very little supervision\n(less than 300 annotations) and with high estimated precision (67%).\nFurther,using the learned specifications, our taint analyzer flagged more than\n20,000 violations in open source projects, 97% of which were\nundetectable without the inferred specifications.

\n", "tags": ["defect", "program analysis"], "tsne_embedding": [14.959173202514648, 11.574461936950684]}, {"key": "chirkova2020empirical", "year": "2020", "title": "Empirical Study of Transformers for Source Code", "abstract": "

Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i. e. follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and all consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.

\n", "tags": ["Transformer"], "tsne_embedding": [-14.117578506469727, -15.053474426269531]}, {"key": "chirkova2021embeddings", "year": "2021", "title": "On the Embeddings of Variables in Recurrent Neural Networks for Source Code", "abstract": "

Source code processing heavily relies on the methods widely used in natural language processing (NLP), but involves specifics that need to be taken into account to achieve higher quality. An example of this specificity is that the semantics of a variable is defined not only by its name but also by the contexts in which the variable occurs. In this work, we develop dynamic embeddings, a recurrent mechanism that adjusts the learned semantics of the variable when it obtains more information about the variable\u2019s role in the program. We show that using the proposed dynamic embeddings significantly improves the performance of the recurrent neural network, in code completion and bug fixing tasks.

\n", "tags": ["autocomplete"], "tsne_embedding": [-11.847761154174805, 6.743480682373047]}, {"key": "chow2023beware", "year": "2023", "title": "Beware of the Unexpected: Bimodal Taint Analysis", "abstract": "

Static analysis is a powerful tool for detecting security vulnerabilities and other programming problems. Global taint tracking, in particular, can spot vulnerabilities arising from complicated data flow across multiple functions. However, precisely identifying which flows are problematic is challenging, and sometimes depends on factors beyond the reach of pure program analysis, such as conventions and informal knowledge. For example, learning that a parameter name of an API function locale ends up in a file path is surprising and potentially problematic. In contrast, it would be completely unsurprising to find that a parameter command passed to an API function execaCommand is eventually interpreted as part of an operating-system command. This paper presents Fluffy, a bimodal taint analysis that combines static analysis, which reasons about data flow, with machine learning, which probabilistically determines which flows are potentially problematic. The key idea is to let machine learning models predict from natural language information involved in a taint flow, such as API names, whether the flow is expected or unexpected, and to inform developers only about the latter. We present a general framework and instantiate it with four learned models, which offer different trade-offs between the need to annotate training data and the accuracy of predictions. We implement Fluffy on top of the CodeQL analysis framework and apply it to 250K JavaScript projects. Evaluating on five common vulnerability types, we find that Fluffy achieves an F1 score of 0.85 or more on four of them across a variety of datasets.

\n", "tags": ["static analysis"], "tsne_embedding": [14.70133113861084, 11.389726638793945]}, {"key": "ciurumelea2020suggesting", "year": "2020", "title": "Suggesting Comment Completions for Python using Neural Language Models", "abstract": "

Source-code comments are an important communication medium between developers to better understand and maintain software. Current research focuses on auto-generating comments by summarizing the code. However, good comments contain additional details, like important design decisions or required trade-offs, and only developers can decide on the proper comment content. Automated summarization techniques cannot include information that does not exist in the code, therefore fully-automated approaches while helpful, will be of limited use. In our work, we propose to empower developers through a semi-automated system instead. We investigate the feasibility of using neural language models trained on a large corpus of Python documentation strings to generate completion suggestions and obtain promising results. By focusing on confident predictions, we can obtain a top-3 accuracy of over 70%, although this comes at the cost of lower suggestion frequency. Our models can be improved by leveraging context information like the signature and the full body of the method. Additionally, we are able to return good accuracy completions even for new projects, suggesting the generalizability of our approach.

\n", "tags": ["bimodal", "autocomplete", "documentation"], "tsne_embedding": [-13.65316390991211, -3.5353806018829346]}, {"key": "clement2020pymt5", "year": "2020", "title": "PyMT5: multi-mode translation of natural language and Python code with transformers", "abstract": "

Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation.

\n", "tags": ["bimodal", "code generation", "summarization", "documentation", "language model", "pretraining"], "tsne_embedding": [1.4458861351013184, -24.541547775268555]}, {"key": "clement2021distilling", "year": "2021", "title": "Distilling Transformers for Neural Cross-Domain Search", "abstract": "

Pre-trained transformers have recently clinched top spots in the gamut of natural language tasks and pioneered solutions to software engineering tasks. Even information retrieval has not been immune to the charm of the transformer, though their large size and cost is generally a barrier to deployment. While there has been much work in streamlining, caching, and modifying transformer architectures for production, here we explore a new direction: distilling a large pre-trained translation model into a lightweight bi-encoder which can be efficiently cached and queried. We argue from a probabilistic perspective that sequence-to-sequence models are a conceptually ideal\u2014albeit highly impractical\u2014retriever. We derive a new distillation objective, implementing it as a data augmentation scheme. Using natural language source code search as a case study for cross-domain search, we demonstrate the validity of this idea by significantly improving upon the current leader of the CodeSearchNet challenge, a recent natural language code search benchmark.

\n", "tags": ["search", "Transformer"], "tsne_embedding": [-3.162325382232666, -7.084317207336426]}, {"key": "clement2021long", "year": "2021", "title": "Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy", "abstract": "

Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an architecture-independent approach for leveraging the syntactic hierarchies of source code for incorporating entire file-level context into a fixed-length window. Using concrete syntax trees of each source file we extract syntactic hierarchies and integrate them into context window by selectively removing from view more specific, less relevant scopes for a given task. We evaluate this approach on code generation tasks and joint translation of natural language and source code in Python programming language, achieving a new state-of-the-art in code completion and summarization for Python in the CodeXGLUE benchmark. We also introduce new CodeXGLUE benchmarks for user-experience-motivated tasks: code completion with normalized literals, method body completion/code summarization conditioned on file-level context.

\n", "tags": ["Transformer", "language model", "code generation"], "tsne_embedding": [0.5202171802520752, 0.43300482630729675]}, {"key": "commit2vec2019lozoya", "year": "2019", "title": "Commit2Vec: Learning Distributed Representations of Code Changes", "abstract": "

Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories).

\n\n

In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source changes (i.e., commits). We use this representation to classify security-relevant commits.

\n\n

Because our method uses transfer learning (that is, we train a network on a \u201cpretext task\u201d for which abundant labeled data is available, and then we use such network for the target task of commit classification, for which fewer labeled instances are available), we studied the impact of pre-training the network using two different pretext tasks versus a randomly initialized model.

\n\n

Our results indicate that representations that leverage the structural information obtained through code syntax outperform token-based representations. Furthermore, the performance metrics obtained when pre-training on a loosely related pretext task with a very large dataset (>10e6 samples) were surpassed when pretraining on a smaller dataset (>10e4 samples) but for a pretext task that is more closely related to the target task.

\n", "tags": ["edit"], "tsne_embedding": [-7.0181779861450195, 0.03357180580496788]}, {"key": "compton2020embedding", "year": "2020", "title": "Embedding Java Classes with code2vec: Improvements from Variable Obfuscation", "abstract": "

Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaining the semantics of the code as much as possible. code2vec is a recently released embedding approach that uses the proxy task of method name prediction to map Java methods to feature vectors. However, experimentation with code2vec shows that it learns to rely on variable names for prediction, causing it to be easily fooled by typos or adversarial attacks. Moreover, it is only able to embed individual Java methods and cannot embed an entire collection of methods such as those present in a typical Java class, making it difficult to perform predictions at the class level (e.g., for the identification of malicious Java classes). Both shortcomings are addressed in the research presented in this paper. We investigate the effect of obfuscating variable names during the training of a code2vec model to force it to rely on the structure of the code rather than specific names and consider a simple approach to creating class-level embeddings by aggregating sets of method embeddings. Our results, obtained on a challenging new collection of source-code classification problems, indicate that obfuscating variable names produces an embedding model that is both impervious to variable naming and more accurately reflects code semantics. The datasets, models, and code are shared for further ML research on source code.

\n", "tags": ["naming", "adversarial"], "tsne_embedding": [4.172283172607422, -11.78293514251709]}, {"key": "corley2015exploring", "year": "2015", "title": "Exploring the Use of Deep Learning for Feature Location", "abstract": "

Deep learning models are a class of neural networks. Relative to n-gram models, deep learning models can capture more complex statistical patterns based on smaller training corpora. In this paper we explore the use of a particular deep learning model, document vectors (DVs), for feature location. DVs seem well suited to use with source code, because they both capture the influence of context on each term in a corpus and map terms into a continuous semantic space that encodes semantic relationships such as synonymy. We present preliminary results that show that a feature location technique (FLT) based on DVs can outperform an analogous FLT based on latent Dirichlet allocation (LDA) and then suggest several directions for future work on the use of deep learning models to improve developer effectiveness in feature location.

\n", "tags": ["feature location", "representation"], "tsne_embedding": [-8.10754680633545, -22.864864349365234]}, {"key": "cummins2017end", "year": "2017", "title": "End-to-end Deep Learning of Optimization Heuristics", "abstract": "

Accurate automatic optimization heuristics are necessary for dealing with the complexity and diversity of modern hardware and software. Machine learning is a proven technique for learning such heuristics, but its success is bound by the quality of the features used. These features must be hand crafted by developers through a combination of expert domain knowledge and trial and error. This makes the quality of the final model directly dependent on the skill and available time of the system architect.

\n\n

Our work introduces a better way for building heuristics. We develop a deep neural network that learns heuristics over raw code, entirely without using code features. The neural network simultaneously constructs appropriate representations of the code and learns how best to optimize, removing the need for manual feature creation. Further, we show that our neural nets can transfer learning from one optimization problem to another, improving the accuracy of new models, without the help of human experts.

\n\n

We compare the effectiveness of our automatically generated heuristics against ones with features hand-picked by experts. We examine two challenging tasks: predicting optimal mapping for heterogeneous parallelism and GPU thread coarsening factors. In 89% of the cases, the quality of our fully automatic heuristics matches or surpasses that of state-of-the-art predictive models using hand-crafted features, providing on average 14% and 12% more performance with no human effort expended on designing features.

\n", "tags": ["optimization"], "tsne_embedding": [6.1369242668151855, 12.635452270507812]}, {"key": "cummins2017synthesizing", "year": "2017", "title": "Synthesizing benchmarks for predictive modeling", "abstract": "

Predictive modeling using machine learning is an effective method for building compiler heuristics, but there is a shortage of benchmarks. Typical machine learning experiments outside of the compilation field train over thousands or millions of examples. In machine learning for compilers, however, there are typically only a few dozen common benchmarks available. This limits the quality of learned models, as they have very sparse training data for what are often high-dimensional feature spaces. What is needed is a way to generate an unbounded number of training programs that finely cover the feature space. At the same time the generated programs must be similar to the types of programs that human developers actually write, otherwise the learning will target the wrong parts of the feature space. We mine open source repositories for program fragments and apply deep learning techniques to automatically construct models for how humans write programs. We sample these models to generate an unbounded number of runnable training programs. The quality of the programs is such that even human developers struggle to distinguish our generated programs from hand-written code. We use our generator for OpenCL programs, CLgen, to automatically synthesize thousands of programs and show that learning over these improves the performance of a state of the art predictive model by 1.27x. In addition, the fine covering of the feature space automatically exposes weaknesses in the feature design which are invisible with the sparse training examples from existing benchmark suites. Correcting these weaknesses further increases performance by 4.30x.

\n", "tags": ["optimization", "code generation"], "tsne_embedding": [6.981786251068115, 9.773744583129883]}, {"key": "cummins2018compiler", "year": "2018", "title": "Compiler Fuzzing through Deep Learning", "abstract": "

Random program generation \u2014 fuzzing \u2014 is an effective technique\nfor discovering bugs in compilers but successful fuzzers require\nextensive development effort for every language supported by the\ncompiler, and often leave parts of the language space untested.

\n\n

We introduce DeepSmith, a novel machine learning approach\nto accelerating compiler validation through the inference of generative models for compiler inputs. Our approach\ninfers a learned\nmodel of the structure of real world code based on a large corpus of open source code. Then, it uses the model to automatically\ngenerate tens of thousands of realistic programs. Finally, we apply\nestablished differential testing methodologies on them to expose\nbugs in compilers. We apply our approach to the OpenCL programming language, automatically exposing bugs with little effort on our\nside. In 1,000 hours of automated testing of commercial and open\nsource compilers, we discover bugs in all of them, submitting 67\nbug reports. Our test cases are on average two orders of magnitude\nsmaller than the state-of-the-art, require 3.03\u00d7 less time to generate\nand evaluate, and expose bugs which the state-of-the-art cannot.\nOur random program generator, comprising only 500 lines of code,\ntook 12 hours to train for OpenCL versus the state-of-the-art taking\n9 man months to port from a generator for C and 50,000 lines of\ncode. With 18 lines of code we extended our program generator to\na second language, uncovering crashes in Solidity compilers in 12\nhours of automated testing.

\n", "tags": ["fuzzing", "code generation"], "tsne_embedding": [17.10828399658203, 12.16179084777832]}, {"key": "cummins2020programl", "year": "2020", "title": "ProGraML: Graph-based Deep Learning for Program Optimization and Analysis", "abstract": "

The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in two key areas: a representation that accurately captures the semantics of programs, and a model architecture with sufficient expressiveness to reason about this representation.

\n\n

We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs. The ProGraML representation is a directed attributed multigraph that captures control, data, and call relations, and summarizes instruction and operand types and ordering. Message Passing Neural Networks propagate information through this structured representation, enabling whole-program or per-vertex classification tasks.

\n\n

ProGraML provides a general-purpose program representation that equips learnable models to perform the types of program analysis that are fundamental to optimization. To this end, we evaluate the performance of our approach first on a suite of traditional compiler analysis tasks: control flow reachability, dominator trees, data dependencies, variable liveness, and common subexpression detection. On a benchmark dataset of 250k LLVM-IR files covering six source programming languages, ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches. We then apply our approach to two high-level tasks - heterogeneous device mapping and program classification - setting new state-of-the-art performance in both.

\n", "tags": ["dataset", "GNN"], "tsne_embedding": [1.2377556562423706, 14.816255569458008]}, {"key": "cvitkovic2018open", "year": "2018", "title": "Open Vocabulary Learning on Source Code with a Graph-Structured Cache", "abstract": "

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models\u2019 performance on a code completion task and a variable naming task \u2014 with over 100% relative improvement on the latter \u2014 at the cost of a moderate increase in computation time.

\n", "tags": ["GNN", "variable misuse", "defect", "representation"], "tsne_embedding": [-3.7394845485687256, 8.855265617370605]}, {"key": "dam2016deep", "year": "2016", "title": "A deep language model for software code", "abstract": "

Existing language models such as n-grams for software code often fail to capture a long context where dependent code elements scatter far apart. In this paper, we propose a novel approach to build a language model for software code to address this particular issue. Our language model, partly inspired by human memory, is built upon the powerful deep learning-based Long Short Term Memory architecture that is capable of learning long-term dependencies which occur frequently in software code. Results from our intrinsic evaluation on a corpus of Java projects have demonstrated the effectiveness of our language model. This work contributes to realizing our vision for DeepSoft, an end-to-end, generic deep learning-based framework for modeling software and its development process.

\n", "tags": ["language model", "code generation"], "tsne_embedding": [-3.0759482383728027, 5.178798675537109]}, {"key": "dash2018refinym", "year": "2018", "title": "RefiNym: Using Names to Refine Types", "abstract": "

Source code is bimodal: it combines a formal algorithmic channel and a natural language channel of identifiers and comments. In this work, we model the bimodality of code with name lows, an assignment low graph augmented to track identiier names. Conceptual types are logically distinct types that do not always coincide with program types. Passwords and URLs are example conceptual types that can share the program type string. Our tool, RefiNym, is an unsupervised method that mines a lattice of conceptual types from name lows and reiies them into distinct nominal types. For string, RefiNym inds and splits conceptual types originally merged into a single type, reducing the number of same-type variables per scope from 8.7 to 2.2 while eliminating 21.9% of scopes that have more than one same-type variable in scope. This makes the code more self-documenting and frees the type system to prevent a developer from inadvertently assigning data across conceptual types.

\n", "tags": ["program analysis", "types"], "tsne_embedding": [14.625188827514648, -15.901102066040039]}, {"key": "david2019neural", "year": "2019", "title": "Neural Reverse Engineering of Stripped Binaries", "abstract": "

We address the problem of predicting procedure names in stripped executables which contain no debug information.\nPredicting procedure names can dramatically ease the task of reverse engineering, saving precious time and human effort. \nWe present a novel approach that leverages static analysis of binaries with encoder-decoder-based neural networks.\nThe main idea is to use static analysis to obtain enriched representations of API call sites; encode a set of sequences\nof these call sites; and finally, attend to the encoded sequences while decoding the target name token-by-token. \nWe evaluate our model by predicting procedure names over 60,000 procedures in 10,000 stripped executables.\nOur model achieves 81.70 precision and 80.12 recall in predicting procedure names within GNU packages, and 55.48\nprecision and 51.31 recall in a diverse, cross-package, dataset. Comparing to previous approaches,\nthe predictions made by our model are much more accurate and informative.

\n", "tags": ["naming", "deobfuscation", "GNN"], "tsne_embedding": [13.694778442382812, 16.708425521850586]}, {"key": "defreez2018path", "year": "2018", "title": "Path-Based Function Embedding and its Application to Specification Mining", "abstract": "

Identifying the relationships among program elements is useful\nfor program understanding, debugging, and analysis. One such\nrelationship is synonymy. Function synonyms are functions that\nplay a similar role in code, e.g. functions that perform initialization\nfor different device drivers, or functions that implement different\nsymmetric-key encryption schemes. Function synonyms are not\nnecessarily semantically equivalent and can be syntactically dissimilar; consequently, approaches for identifying code clones or\nfunctional equivalence cannot be used to identify them. This paper presents func2vec, an algorithm that maps each function to a vector in a vector space such that function synonyms are grouped\ntogether. We compute the function embedding by training a neu-\nral network on sentences generated from random walks over an\nencoding of the program as a labeled pushdown system (\u2113-PDS).\nWe demonstrate that func2vec\nis effective at identifying function\nsynonyms in the Linux kernel. Furthermore, we show how function\nsynonyms enable mining error-handling specifications with high\nsupport in Linux file systems and drivers.

\n", "tags": ["program analysis", "representation"], "tsne_embedding": [7.871365547180176, -13.12382698059082]}, {"key": "derezendemartins2020concra.md", "year": "2020", "title": "CoNCRA: A Convolutional Neural Network Code Retrieval Approach", "abstract": "

Software developers routinely search for code using general-purpose search engines. However, these search engines cannot find code semantically unless it has an accompanying description. We propose a technique for semantic code search: A Convolutional Neural Network approach to code retrieval (CoNCRA). Our technique aims to find the code snippet that most closely matches the developer\u2019s intent, expressed in natural language. We evaluated our approach\u2019s efficacy on a dataset composed of questions and code snippets collected from Stack Overflow. Our preliminary results showed that our technique, which prioritizes local interactions (words nearby), improved the state-of-the-art (SOTA) by 5% on average, retrieving the most relevant code snippets in the top 3 (three) positions by almost 80% of the time. Therefore, our technique is promising and can improve the efficacy of semantic code retrieval.

\n\n", "tags": ["search"], "tsne_embedding": [-2.486339807510376, -13.9393310546875]}, {"key": "devanbu2020deep", "year": "2020", "title": "Deep Learning & Software Engineering: State of Research and Future Directions", "abstract": "

Given the current transformative potential of research that sits at the intersection of Deep Learning (DL) and Software Engineering (SE), an NSF-sponsored community workshop was conducted in co-location with the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE\u201919) in San Diego, California. The goal of this workshop was to outline high priority areas for cross-cutting research. While a multitude of exciting directions for future work were identified, this report provides a general summary of the research areas representing the areas of highest priority which were discussed at the workshop. The intent of this report is to serve as a potential roadmap to guide future work that sits at the intersection of SE & DL.

\n", "tags": ["survey"], "tsne_embedding": [3.807271957397461, 22.54012107849121]}, {"key": "devlin2017semantic", "year": "2017", "title": "Semantic Code Repair using Neuro-Symbolic Transformation Networks", "abstract": "

We study the problem of semantic code repair, which can be broadly defined as automatically fixing\nnon-syntactic bugs in source code. The majority of past work in semantic code repair assumed access\nto unit tests against which candidate repairs could be validated. In contrast, the goal here is to\ndevelop a strong statistical model to accurately predict both bug locations and exact fixes without\naccess to information about the intended correct behavior of the program. Achieving such a goal\nrequires a robust contextual repair model, which we train on a large corpus of real-world source\ncode that has been augmented with synthetically injected bugs. Our framework adopts a two-stage\napproach where first a large set of repair candidates are generated by rule-based processors, and\nthen these candidates are scored by a statistical model using a novel neural network architecture\nwhich we refer to as Share, Specialize, and Compete. Specifically, the architecture (1) generates\na shared encoding of the source code using an RNN over the abstract syntax tree, \n(2) scores each candidate repair using specialized network modules, and (3) then normalizes these\nscores together so they can compete against one another in comparable probability space. We evaluate\nour model on a real-world test set gathered from GitHub containing four common categories of bugs.\nOur model is able to predict the exact correct repair 41% of the time with a single guess, compared\nto 13% accuracy for an attentional sequence-to-sequence model.

\n", "tags": ["repair"], "tsne_embedding": [20.521310806274414, -0.3797032833099365]}, {"key": "deze2021mulcode", "year": "2021", "title": "MulCode: A Multi-task Learning Approach for Source Code Understanding", "abstract": "

Recent years have witnessed the significant rise of Deep Learning (DL) techniques applied to source code. Researchers exploit DL for a multitude of tasks and achieve impressive results. However, most tasks are explored separately, resulting in a lack of generalization of the solutions. In this work, we propose MulCode, a multi-task learning approach for source code understanding that learns unified representation space for tasks, with the pre-trained BERT model for the token sequence and the Tree-LSTM model for abstract syntax trees. Furthermore, we integrate two source code views into a hybrid representation via the attention mechanism and set learnable uncertainty parameters to adjust the tasks\u2019 relationship. We train and evaluate MulCode in three downstream tasks: comment classification, author attribution, and duplicate function detection. In all tasks, MulCode outperforms the state-of-theart techniques. Moreover, experiments on three unseen tasks demonstrate the generalization ability of MulCode compared with state-of-the-art embedding methods.

\n", "tags": ["representation"], "tsne_embedding": [-7.798166275024414, -2.854518413543701]}, {"key": "deze2022bridging", "year": "2022", "title": "Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding", "abstract": "

With the great success of pre-trained models, the pretrain-then-finetune paradigm has been widely adopted on downstream tasks for source code understanding. However, compared to costly training a large-scale model from scratch, how to effectively adapt pre-trained models to a new task has not been fully explored. In this paper, we propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity, and help pre-trained models learn semantic features that are invariant to these semantically equivalent transformations. Further, we introduce curriculum learning to organize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models.

\n\n

We apply our approach to a range of pre-trained models, and they significantly outperform the state-of-the-art models on tasks for source code understanding, such as algorithm classification, code clone detection, and code search. Our experiments even show that without heavy pre-training on code data, natural language pre-trained model RoBERTa fine-tuned with our lightweight approach could outperform or rival existing code pre-trained models fine-tuned on the above tasks, such as CodeBERT and GraphCodeBERT. This finding suggests that there is still much room for improvement in code pre-trained models.

\n", "tags": ["representation", "language model"], "tsne_embedding": [-3.0403597354888916, -3.599219799041748]}, {"key": "dinella2020hoppity", "year": "2020", "title": "Hoppity: Learning Bug Detection and Repair", "abstract": "

We present a learning-based approach to detect and fix a broad range of bugs in Javascript programs. We frame the problem in terms of learning a sequence of graph transformations: given a buggy program modeled by a graph structure, our model makes a sequence of predictions including the position of bug nodes and corresponding graph edits to produce a fix. Unlike previous works that use deep neural networks, our approach targets bugs that are more complex and semantic in nature (i.e.~bugs that require adding or deleting statements to fix). We have realized our approach in a tool called HOPPITY. By training on 338,877 Javascript code change commits on Github, HOPPITY correctly detects and fixes bugs in 9,612 out of 42,365 programs in an end-to-end fashion. Given the bug location and type of the fix, HOPPITY also outperforms the baseline approach by a wide margin.

\n", "tags": ["edit", "repair"], "tsne_embedding": [18.813232421875, 3.879596471786499]}, {"key": "dinella2021deepmerge", "year": "2021", "title": "DeepMerge: Learning to Merge Programs", "abstract": "

Program merging is ubiquitous in modern software development. Although commonly used in most version control systems, text-based merge algorithms are prone to producing spurious merge conflicts: they report a conflict even when program changes do not interfere with each other semantically. Spurious merge conflicts are costly to development as the need for manual intervention stalls modern continuous integration pipelines. We propose a novel data-driven approach to identify and resolve spurious merge conflicts with a sequence-to-sequence machine learning model. We realize our approach in a tool DeepMerge that uses a novel combination of (i) an edit-aware embedding of merge inputs and (ii) a variation of pointer networks to construct resolutions from input segments. We also propose an algorithm to extract ground truth manual resolutions from a code corpus and employ it to curate a dataset comprising 10,729 non-trivial resolutions in Javascript programs. Our evaluation shows that DeepMerge can predict correct resolutions with high precision (72%) and modest recall (34%) on the dataset overall, and high recall (78%) on merges comprising of upto 3 lines that comprise 24% of the dataset.

\n", "tags": ["edit", "repair"], "tsne_embedding": [-1.3383288383483887, 8.521072387695312]}, {"key": "dinella2022toga", "year": "2022", "title": "TOGA: A Neural Method for Test Oracle Generation", "abstract": "

Testing is widely recognized as an important stage of the software\ndevelopment lifecycle. Effective software testing can provide benefits such as bug finding, preventing regressions, and documentation.\nIn terms of documentation, unit tests express a unit\u2019s intended\nfunctionality, as conceived by the developer. A test oracle, typically expressed as an condition, documents the intended behavior\nof a unit under a given test prefix. Synthesizing a functional test\noracle is a challenging problem, as it must capture the intended\nfunctionality rather than the implemented functionality.\nIn this paper, we propose TOGA (a neural method for Test Oracle\nGenerAtion), a unified transformer-based neural approach to infer\nboth exceptional and assertion test oracles based on the context of\nthe focal method. Our approach can handle units with ambiguous\nor missing documentation, and even units with a missing implementation. We evaluate our approach on both oracle inference accuracy\nand functional bug-finding. Our technique improves accuracy by\n33% over existing oracle inference approaches, achieving 96% overall accuracy on a held out test dataset. Furthermore, we show that\nwhen integrated with a automated test generation tool (EvoSuite),\nour approach finds 57 real world bugs in large-scale Java programs,\nincluding 30 bugs that are not found by any other automated testing\nmethod in our evaluation

\n", "tags": ["code generation", "Transformer", "test generation"], "tsne_embedding": [-16.277912139892578, 11.96065902709961]}, {"key": "ding2019asm2vec", "year": "2019", "title": "Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization", "abstract": "

Reverse engineering is a manually intensive but necessary technique for understanding the inner workings of new malware, finding vulnerabilities in existing systems, and detecting patent infringements in released software. An assembly clone search engine facilitates the work of reverse engineers by identifying those duplicated or known parts. However, it is challenging to design a robust clone search engine, since there exist various compiler optimization options and code obfuscation techniques that make logically similar assembly functions appear to be very different. A practical clone search engine relies on a robust vector representation of assembly code. However, the existing clone search approaches, which rely on a manual feature engineering process to form a feature vector for an assembly function, fail to consider the relationships between features and identify those unique patterns that can statistically distinguish assembly functions. To address this problem, we propose to jointly learn the lexical semantic relationships and the vector representation of assembly functions based on assembly code. We have developed an assembly code representation learning model \\emph{Asm2Vec}. It only needs assembly code as input and does not require any prior knowledge such as the correct mapping between assembly functions. It can find and incorporate rich semantic relationships among tokens appearing in assembly code. We conduct extensive experiments and benchmark the learning model with state-of-the-art static and dynamic clone search approaches. We show that the learned representation is more robust and significantly outperforms existing methods against changes introduced by obfuscation and optimizations.

\n", "tags": ["representation", "clone"], "tsne_embedding": [11.434524536132812, 16.076316833496094]}, {"key": "ding2021contrastive", "year": "2021", "title": "Contrastive Learning for Source Code with Structural and Functional Properties", "abstract": "

Pre-trained transformer models have recently shown promises for understanding the source code. Most existing works expect to understand code from the textual features and limited structural knowledge of code. However, the program functionalities sometimes cannot be fully revealed by the code sequence, even with structure information. Programs can contain very different tokens and structures while sharing the same functionality, but changing only one or a few code tokens can introduce unexpected or malicious program behaviors while preserving the syntax and most tokens. In this work, we present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code. We first employ automated, structure-guided code transformation algorithms that generate (i.) functionally equivalent code that looks drastically different from the original one, and (ii.) textually and syntactically very similar code that is functionally distinct from the original. We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective. To encode the structure information, we introduce a new node-type masked language model objective that helps the model learn about structural context. We pre-train BOOST with a much smaller dataset than the state-of-the-art models, but our small models can still match or outperform these large models in code understanding and generation tasks.

\n", "tags": ["representation", "pretraining", "Transformer"], "tsne_embedding": [-4.703351020812988, -0.6609671115875244]}, {"key": "ding2023static", "year": "2023", "title": "A Static Evaluation of Code Completion by Large Language Models", "abstract": "

Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the contrary, static analysis tools such as linters, which can detect errors without running the program, haven\u2019t been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models. Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.

\n", "tags": ["LLM", "static analysis"], "tsne_embedding": [2.330632448196411, 1.5346332788467407]}, {"key": "doderlein2022piloting", "year": "2022", "title": "Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic?", "abstract": "

Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently attracted attention in code assistants, with programs automatically written in a given programming language from a programming task description in natural language. They have the potential to save time and effort when writing code. However, these systems are currently poorly understood, preventing them from being used optimally. In this paper, we investigate the various input parameters of two language models, and conduct a study to understand if variations of these input parameters (e.g. programming task description and the surrounding context, creativity of the language model, number of generated solutions) can have a significant impact on the quality of the generated programs. We design specific operators for varying input parameters and apply them over two code assistants (Copilot and Codex) and two benchmarks representing algorithmic problems (HumanEval and LeetCode). Our results showed that varying the input parameters can significantly improve the performance of language models. However, there is a tight dependency when varying the temperature, the prompt and the number of generated solutions, making potentially hard for developers to properly control the parameters to obtain an optimal result. This work opens opportunities to propose (automated) strategies for improving performance.

\n", "tags": ["Transformer"], "tsne_embedding": [6.161671161651611, 2.0264720916748047]}, {"key": "dong2023codescore", "year": "2023", "title": "CodeScore: Evaluating Code Generation by Learning Code Execution", "abstract": "

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing CEMs can be categorized into match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) and execution-based CEMs (e.g., AvgPassRatio and Pass@k), but both of them suffer from some issues. The former only measures differences in surface form regardless of the functional equivalence of codes, while the latter has huge execution overheads, including collecting expensive test cases, resolving tedious execution dependencies, and enormous execution time. To address these issues, in this paper, we propose CodeScore, an efficient and effective CEM for code generation, which estimates test case PassRatio of generated code without executing code. We also present a framework named UniCE for training unified code evaluation models by learning code execution, i.e., learning PassRatio and Executability of generated code. In order to learn code execution comprehensively, we construct more than 100 test cases for each task in several popular benchmark datasets, covering MBPP, APPS, and HumanEval. Experimental results show that CodeScore has obtained a state-of-the-art correlation with execution-based CEMs. CodeScore is strongly correlated with AvgPassPatio, and binary CodeScore is moderately correlated with Pass@1. In particular, CodeScore eliminates the need for test cases and execution dependencies in inference, and CodeScore reduces execution time by three orders of magnitude compared to AvgPassPatio and Pass@1.

\n", "tags": ["Transformer", "evaluation"], "tsne_embedding": [7.499485015869141, 0.24941056966781616]}, {"key": "drain2021deepdebug", "year": "2021", "title": "DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons", "abstract": "

The joint task of bug localization and program repair is an integral part of the software development process. In this work we present DeepDebug, an approach to automated debugging using large, pretrained transformers. We begin by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs. We apply these synthetic bugs toward two ends. First, we directly train a backtranslation model on all functions from 200K repositories. Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions in those repositories that are covered by passing tests. This provides us with rich debugging information such as stack traces and print statements, which we use to finetune our model which was pretrained on raw source code. Finally, we strengthen all our models by expanding the context window beyond the buggy function itself, and adding a skeleton consisting of that function\u2019s parent class, imports, signatures, docstrings, and method bodies, in order of priority. On the QuixBugs benchmark, we increase the total number of fixes found by over 50%, while also decreasing the false positive rate from 35% to 5% and decreasing the timeout from six hours to one minute. On our own benchmark of executable tests, our model fixes 68% of all bugs on its first attempt without using traces, and after adding traces it fixes 75% on first attempt. We will open-source our framework and validation set for evaluating on executable tests.

\n", "tags": ["repair", "Transformer"], "tsne_embedding": [21.861431121826172, 5.887087821960449]}, {"key": "drain2021generating", "year": "2021", "title": "Generating Bug-Fixes Using Pretrained Transformers", "abstract": "

Detecting and fixing bugs are two of the most important yet frustrating parts of the software development cycle. Existing bug detection tools are based mainly on static analyzers, which rely on mathematical logic and symbolic reasoning about the program execution to detect common types of bugs. Fixing bugs is typically left out to the developer. In this work we introduce DeepDebug: a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub repositories. We frame bug-patching as a sequence-to-sequence learning task consisting of two steps: (i) denoising pretraining, and (ii) supervised finetuning on the target translation task. We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch, while domain-adaptive pretraining from natural language to code further improves the accuracy by another 32%. We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art. In contrast to prior work, we attain our best results when generating raw code, as opposed to working with abstracted code that tends to only benefit smaller capacity models. Finally, we observe a subtle improvement from adding syntax embeddings along with the standard positional embeddings, as well as with adding an auxiliary task to predict each token\u2019s syntactic class. Despite focusing on Java, our approach is language agnostic, requiring only a general-purpose parser such as tree-sitter.

\n", "tags": ["Transformer", "repair"], "tsne_embedding": [18.454635620117188, 2.0798184871673584]}, {"key": "edelmann2019neural", "year": "2019", "title": "Neural-Network Guided Expression Transformation", "abstract": "

Optimizing compilers, as well as other translator systems, often work by rewriting expressions according to equivalence preserving rules. Given an input expression and its optimized form, finding the sequence of rules that were applied is a non-trivial task. Most of the time, the tools provide no proof, of any kind, of the equivalence between the original expression and its optimized form. In this work, we propose to reconstruct proofs of equivalence of simple mathematical expressions, after the fact, by finding paths of equivalence preserving transformations between expressions. We propose to find those sequences of transformations using a search algorithm, guided by a neural network heuristic. Using a Tree-LSTM recursive neural network, we learn a distributed representation of expressions where the Manhattan distance between vectors approximately corresponds to the rewrite distance between expressions. We then show how the neural network can be efficiently used to search for transformation paths, leading to substantial gain in speed compared to an uninformed exhaustive search. In one of our experiments, our neural-network guided search algorithm is able to solve more instances with a 2 seconds timeout per instance than breadth-first search does with a 5 minutes timeout per instance.

\n", "tags": ["optimization", "grammar"], "tsne_embedding": [-9.758055686950684, 11.883269309997559]}, {"key": "ederhardt2019unsupervised", "year": "2019", "title": "Unsupervised Learning of API Aliasing Specifications", "abstract": "

Real world applications make heavy use of powerful libraries\nand frameworks, posing a significant challenge for static analysis\nas the library implementation may be very complex or unavailable.\nThus, obtaining specifications that summarize the behaviors of\nthe library is important as it enables static analyzers to precisely\ntrack the effects of APIs on the client program, without requiring\nthe actual API implementation.

\n\n

In this work, we propose a novel method\nfor discovering aliasing specifications of APIs by learning from a large\ndataset of programs. Unlike prior work, our method does not require\nmanual annotation, access to the library\u2019s source code or ability to\nrun its APIs. Instead, it learns specifications in a fully unsupervised manner,\nby statically observing usages of APIs in the dataset. The core idea is to\nlearn a probabilistic model of interactions between API methods and aliasing\nobjects, enabling identification of additional likely aliasing relations,\nand to then infer aliasing specifications ofAPIs that explain these relations.\nThe learned specifications are then used to augment an API-aware points-to analysis.

\n\n

We implemented our approach in a tool called USpec and used it to automatically\nlearn aliasing specifications from millions of source code files.\nUSpec learned over 2000 specifications of various Java and Python APIs, in the process\nimproving the results of the points-to analysis and its clients.

\n", "tags": ["API", "program analysis"], "tsne_embedding": [9.124704360961914, -17.2502498626709]}, {"key": "efstathiou2019semantic", "year": "2019", "title": "Semantic Source Code Models Using Identifier Embeddings", "abstract": "

The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data-driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state-of-the-art library for learning word representations. Each model is trained on data from a single programming language; the code mined for producing all models amounts to over 13.000 repositories. We indicate dissimilarities between natural language and source code, as well as variations in coding conventions in between the different programming languages we processed. We describe how these heterogeneities guided the data preprocessing decisions we took and the selection of the training parameters in the released models. Finally, we propose potential applications of the models and discuss limitations of the models.

\n", "tags": ["representation"], "tsne_embedding": [5.13707971572876, -13.44411563873291]}, {"key": "eghbali2022crystalbleu", "year": "2022", "title": "CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code", "abstract": "

Recent years have brought a surge of work on predicting pieces\nof source code, e.g., for code completion, code migration, program\nrepair, or translating natural language into code. All this work faces\nthe challenge of evaluating the quality of a prediction w.r.t. some\noracle, typically in the form of a reference solution. A common\nevaluation metric is the BLEU score, an n-gram-based metric originally proposed for evaluating natural language translation, but\nadopted in software engineering because it can be easily computed\non any programming language and enables automated evaluation at\nscale. However, a key difference between natural and programming\nlanguages is that in the latter, completely unrelated pieces of code\nmay have many common n-grams simply because of the syntactic\nverbosity and coding conventions of programming languages. We\nobserve that these trivially shared n-grams hamper the ability of\nthe metric to distinguish between truly similar code examples and\ncode examples that are merely written in the same language. This\npaper presents CrystalBLEU, an evaluation metric based on BLEU,\nthat allows for precisely and efficiently measuring the similarity of\ncode. Our metric preserves the desirable properties of BLEU, such\nas being language-agnostic, able to handle incomplete or partially\nincorrect code, and efficient, while reducing the noise caused by\ntrivially shared n-grams. We evaluate CrystalBLEU on two datasets\nfrom prior work and on a new, labeled dataset of semantically equivalent programs. Our results show that CrystalBLEU can distinguish\nsimilar from dissimilar code examples 1.9\u20134.5 times more effectively, when compared to the original BLEU score and a previously\nproposed variant of BLEU for code.

\n", "tags": ["evaluation"], "tsne_embedding": [6.7913713455200195, -10.996931076049805]}, {"key": "ellis2021dreamcoder", "year": "2021", "title": "DreamCoder: bootstrapping inductive program synthesis with wake-sleep library learning", "abstract": "

We present a system for inductive program synthesis called DreamCoder, which inputs a corpus of synthesis problems each specified by one or a few examples, and automatically derives a library of program components and a neural search policy that can be used to efficiently solve other similar synthesis problems. The library and search policy bootstrap each other iteratively through a variant of \u201cwake-sleep\u201d approximate Bayesian learning. A new refactoring algorithm based on E-graph matching identifies common sub-components across synthesized programs, building a progressively deepening library of abstractions capturing the structure of the input domain. We evaluate on eight domains including classic program synthesis areas and AI tasks such as planning, inverse graphics, and equation discovery. We show that jointly learning the library and neural search policy leads to solving more problems, and solving them more quickly.

\n", "tags": ["synthesis", "search"], "tsne_embedding": [6.257015228271484, 7.016359806060791]}, {"key": "elnaggar2021codetrans", "year": "2021", "title": "CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing", "abstract": "

Currently, a growing number of mature natural language processing applications make people\u2019s life more convenient. Such applications are built by source code - the language in software engineering. However, the applications for understanding source code language to ease the software engineering process are under-researched. Simultaneously, the transformer model, especially its combination with transfer learning, has been proven to be a powerful technique for natural language processing tasks. These breakthroughs point out a promising direction for process source code and crack software engineering tasks. This paper describes CodeTrans - an encoder-decoder transformer model for tasks in the software engineering domain, that explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks. Moreover, we have investigated the effect of different training strategies, including single-task learning, transfer learning, multi-task learning, and multi-task learning with fine-tuning. CodeTrans outperforms the state-of-the-art models on all the tasks. To expedite future works in the software engineering domain, we have published our pre-trained models of CodeTrans.

\n", "tags": ["Transformer"], "tsne_embedding": [-6.095447063446045, -2.6534104347229004]}, {"key": "eniser2023automatically", "year": "2023", "title": "Automatically Testing Functional Properties of Code Translation Models", "abstract": "

Large language models are becoming increasingly practical for translating code across programming languages, a process known as $transpiling$. Even though automated transpilation significantly boosts developer productivity, a key concern is whether the generated code is correct. Existing work initially used manually crafted test suites to test the translations of a small corpus of programs; these test suites were later automated. In contrast, we devise the first approach for automated, functional, property-based testing of code translation models. Our general, user-provided specifications about the transpiled code capture a range of properties, from purely syntactic to purely semantic ones. As shown by our experiments, this approach is very effective in detecting property violations in popular code translation models, and therefore, in evaluating model quality with respect to given properties. We also go a step further and explore the usage scenario where a user simply aims to obtain a correct translation of some code with respect to certain properties without necessarily being concerned about the overall quality of the model. To this purpose, we develop the first property-guided search procedure for code translation models, where a model is repeatedly queried with slightly different parameters to produce alternative and potentially more correct translations. Our results show that this search procedure helps to obtain significantly better code translations.

\n", "tags": ["translation"], "tsne_embedding": [1.5963600873947144, -20.07757568359375]}, {"key": "feng2020codebert", "year": "2020", "title": "CodeBERT: A Pre-Trained Model for Programming and Natural Languages", "abstract": "

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.

\n", "tags": ["pretraining"], "tsne_embedding": [-4.743239402770996, -5.3774213790893555]}, {"key": "fernandes2019structured", "year": "2019", "title": "Structured Neural Summarization", "abstract": "

Summarization of long sequences into a concise statement is a core problem in natural language processing, requiring non-trivial understanding of the input. Based on the promising results of graph neural networks on highly structured data, we develop a framework to extend existing sequence encoders with a graph component that can reason about long-distance relationships in weakly structured data such as text. In an extensive evaluation, we show that the resulting hybrid sequence-graph models outperform both pure sequence models as well as pure graph models on a range of summarization tasks.

\n", "tags": ["summarization", "GNN", "documentation"], "tsne_embedding": [-18.65509796142578, -6.495089054107666]}, {"key": "fowkes2016parameter", "year": "2016", "title": "Parameter-Free Probabilistic API Mining across GitHub", "abstract": "

Existing API mining algorithms can be difficult to use as they require expensive parameter tuning and the returned set of API calls can be large, highly redundant and difficult to understand. To address this, we present PAM (Probabilistic API Miner), a near parameter-free probabilistic algorithm for mining the most interesting API call patterns. We show that PAM significantly outperforms both MAPO and UPMiner, achieving 69% test-set precision, at retrieving relevant API call sequences from GitHub. Moreover, we focus on libraries for which the developers have explicitly provided code examples, yielding over 300,000 LOC of hand-written API example code from the 967 client projects in the data set. This evaluation suggests that the hand-written examples actually have limited coverage of real API usages.

\n\n", "tags": ["API", "pattern mining"], "tsne_embedding": [8.925658226013184, -18.729835510253906]}, {"key": "fowkes2017autofolding", "year": "2017", "title": "Autofolding for Source Code Summarization", "abstract": "

Developers spend much of their time reading and browsing source code, raising new opportunities for summarization methods. Indeed, modern code editors provide code folding, which allows one to selectively hide blocks of code. However this is impractical to use as folding decisions must be made manually or based on simple rules. We introduce the\nautofolding problem, which is to automatically create a code summary by folding less informative code regions. We present a novel solution by formulating the problem as a sequence of AST folding decisions, leveraging a scoped topic model for code tokens. On an annotated set of popular open source projects, we show that our summarizer outperforms simpler baselines, yielding a 28% error reduction. Furthermore, we find through a case study that our summarizer is strongly preferred by experienced developers. More broadly, we hope this work will aid program comprehension by turning code folding into a usable and valuable tool.

\n", "tags": ["summarization"], "tsne_embedding": [-16.36794090270996, -9.737833976745605]}, {"key": "franks2015cacheca", "year": "2015", "title": "CACHECA: A Cache Language Model Based Code Suggestion Tool", "abstract": "

Nearly every Integrated Development Environment includes a form of code completion. The suggested completions (\u201csuggestions\u201d) are typically based on information available at compile time, such as type signatures and variables in scope. A statistical approach, based on estimated models of code patterns in large code corpora, has been demonstrated to be effective at predicting tokens given a context. In this demo, we present CACHECA, an Eclipse plugin that combines the native suggestions with a statistical suggestion regime. We demonstrate that a combination of the two approaches more than doubles Eclipse\u2019s suggestion accuracy. A video demonstration is available at https://www.youtube.com/watch?v=3INk0N3JNtc.

\n", "tags": ["language model"], "tsne_embedding": [-10.773092269897461, -16.59006690979004]}, {"key": "fried2022incoder", "year": "2022", "title": "InCoder: A Generative Model for Code Infilling and Synthesis", "abstract": "

Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first generative model that is able to directly perform zero-shot code infilling, which we evaluate on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The InCoder models and code are publicly released at https://sites.google.com/view/incoder-code-models

\n", "tags": ["Transformer", "code generation", "naming", "summarization"], "tsne_embedding": [3.6642134189605713, 4.217130661010742]}, {"key": "fu2019coda", "year": "2019", "title": "Coda: An End-to-End Neural Program Decompiler", "abstract": "

Reverse engineering of binary executables is a critical problem in the computer security domain. On the one hand, malicious parties may recover interpretable source codes from the software products to gain commercial advantages. On the other hand, binary decompilation can be leveraged for code vulnerability analysis and malware detection. However, efficient binary decompilation is challenging. Conventional decompilers have the following major limitations: (i) they are only applicable to specific source-target language pair, hence incurs undesired development cost for new language tasks; (ii) their output high-level code cannot effectively preserve the correct functionality of the input binary; (iii) their output program does not capture the semantics of the input and the reversed program is hard to interpret. To address the above problems, we propose Coda1, the first end-to-end neural-based framework for code decompilation. Coda decomposes the decompilation task into of two key phases: First, Coda employs an instruction type-aware encoder and a tree decoder for generating an abstract syntax tree (AST) with attention feeding during the code sketch generation stage. Second, Coda then updates the code sketch using an iterative error correction machine guided by an ensembled neural error predictor. By finding a good approximate candidate and then fixing it towards perfect, Coda achieves superior with performance compared to baseline approaches. We assess Coda\u2019s performance with extensive experiments on various benchmarks. Evaluation results show that Coda achieves an average of 82% program recovery accuracy on unseen binary samples, where the state-of-the-art decompilers yield 0% accuracy. Furthermore, Coda outperforms the sequence-to-sequence model with attention by a margin of 70% program accuracy. Our work reveals the vulnerability of binary executables and imposes a new threat to the protection of Intellectual Property (IP) for software development.

\n", "tags": ["decompilation"], "tsne_embedding": [12.509137153625488, 15.252440452575684]}, {"key": "gao2019neural", "year": "2019", "title": "A Neural Model for Method Name Generation from Functional Description", "abstract": "

The names of software artifacts, e.g., method names, are important for software understanding and maintenance, as good names can help developers easily understand others\u2019 code. However, the existing naming guidelines are difficult for developers, especially novices, to come up with meaningful, concise and compact names for the variables, methods, classes and files. With the popularity of open source, an enormous amount of project source code can be accessed, and the exhaustiveness and instability of manually naming methods could now be relieved by automatically learning a naming model from a large code repository. Nevertheless, building a comprehensive naming system is still challenging, due to the gap between natural language functional descriptions and method names. Specifically, there are three challenges: how to model the relationship between the functional descriptions and formal method names, how to handle the explosion of vocabulary when dealing with large repositories, and how to leverage the knowledge learned from large repositories to a specific project. To answer these questions, we propose a neural network to directly generate readable method names from natural language description. The proposed method is built upon the encoder-decoder framework with the attention and copying mechanisms. Our experiments show that our method can generate meaningful and accurate method names and achieve significant improvement over the state-of-the-art baseline models. We also address the cold-start problem using a training trick to utilize big data in GitHub for specific projects.

\n", "tags": ["naming", "summarization"], "tsne_embedding": [9.557541847229004, -7.01735782623291]}, {"key": "garg2022deepperf", "year": "2022", "title": "DeepPERF: A Deep Learning-Based Approach For Improving Software Performance", "abstract": "

Improving software performance is an important yet challenging part of the software development cycle. Today, the majority of performance inefficiencies are identified and patched by performance experts. Recent advancements in deep learning approaches and the wide-spread availability of open source data creates a great opportunity to automate the identification and patching of performance problems. In this paper, we present DeepPERF, a transformer-based approach to suggest performance improvements for C# applications. We pretrain DeepPERF on English and Source code corpora and followed by finetuning for the task of generating performance improvement patches for C# applications. Our evaluation shows that our model can generate the same performance improvement suggestion as the developer fix in ~53% of the cases, getting ~34% of them verbatim in our expert-verified dataset of performance changes made by C# developers. Additionally, we evaluate DeepPERF on 50 open source C# repositories on GitHub using both benchmark and unit tests and find that our model is able to suggest valid performance improvements that can improve both CPU usage and Memory allocations. So far we\u2019ve submitted 19 pull-requests with 28 different performance optimizations and 11 of these PRs have been approved by the project owners.

\n", "tags": ["Transformer", "optimization"], "tsne_embedding": [5.300485610961914, 20.98826026916504]}, {"key": "gharibi2024t5apr", "year": "2024", "title": "T5APR: Empowering Automated Program Repair across Languages through Checkpoint Ensemble", "abstract": "

Automated program repair (APR) using deep learning techniques has become an important area of research in recent years, aiming to automatically generate bug-fixing patches that can improve software reliability and maintainability. However, most existing methods either target a single language or require high computational resources to train multilingual models. In this paper, we propose T5APR, a novel neural program repair approach that provides a unified solution for bug fixing across multiple programming languages. T5APR leverages CodeT5, a powerful pre-trained text-to-text transformer model, and adopts a checkpoint ensemble strategy to improve patch recommendation. We conduct comprehensive evaluations on six well-known benchmarks in four programming languages (Java, Python, C, JavaScript), demonstrating T5APR\u2019s competitiveness against state-of-the-art techniques. T5APR correctly fixes 1,985 bugs, including 1,442 bugs that none of the compared techniques has fixed. We further support the effectiveness of our approach by conducting detailed analyses, such as comparing the correct patch ranking among different techniques. The findings of this study demonstrate the potential of T5APR for use in real-world applications and highlight the importance of multilingual approaches in the field of APR.

\n", "tags": ["repair", "Transformer"], "tsne_embedding": [20.28875732421875, 0.7284896969795227]}, {"key": "gholamian2021naturalness", "year": "2021", "title": "On the Naturalness and Localness of Software Logs", "abstract": "

Logs are an essential part of the development and\nmaintenance of large and complex software systems as they\ncontain rich information pertaining to the dynamic content and\nstate of the system. As such, developers and practitioners rely\nheavily on the logs to monitor their systems. In parallel, the\nincreasing volume and scale of the logs, due to the growing\ncomplexity of modern software systems, renders the traditional\nway of manual log inspection insurmountable. Consequently, to\nhandle large volumes of logs efficiently and effectively, various\nprior research aims to automate the analysis of log files. Thus, in\nthis paper, we begin with the hypothesis that log files are natural\nand local and these attributes can be applied for automating log\nanalysis tasks. We guide our research with six research questions\nwith regards to the naturalness and localness of the log files, and\npresent a case study on anomaly detection and introduce a tool\nfor anomaly detection, called ANALOG, to demonstrate how our\nnew findings facilitate the automated analysis of logs.

\n", "tags": ["logging", "language model"], "tsne_embedding": [23.353139877319336, 9.029112815856934]}, {"key": "glassman2015overcode", "year": "2015", "title": "OverCode: visualizing variation in student solutions to programming problems at scale", "abstract": "

In MOOCs, a single programming exercise may produce thousands of solutions from learners. Understanding solution variation is important for providing appropriate feedback to students at scale. The wide variation among these solutions can be a source of pedagogically valuable examples and can be used to refine the autograder for the exercise by exposing corner cases. We present OverCode, a system for visualizing and exploring thousands of programming solutions. OverCode uses both static and dynamic analysis to cluster similar solutions, and lets teachers further filter and cluster solutions based on different criteria. We evaluated OverCode against a nonclustering baseline in a within-subjects study with 24 teaching assistants and found that the OverCode interface allows teachers to more quickly develop a high-level view of students\u2019 understanding and misconceptions, and to provide feedback that is relevant to more students\u2019 solutions.

\n", "tags": ["repair"], "tsne_embedding": [-14.808202743530273, 18.565975189208984]}, {"key": "goens2019case", "year": "2019", "title": "A case study on machine learning for synthesizing benchmarks", "abstract": "

Good benchmarks are hard to find because they require a substantial effort to keep them representative for the constantly changing challenges of a particular field. Synthetic benchmarks are a common approach to deal with this, and methods from machine learning are natural candidates for synthetic benchmark generation. In this paper we investigate the usefulness of machine learning in the prominent CLgen benchmark generator. We re-evaluate CLgen by comparing the benchmarks generated by the model with the raw data used to train it. This re-evaluation indicates that, for the use case considered, machine learning did not yield additional benefit over a simpler method using the raw data. We investigate the reasons for this and provide further insights into the challenges the problem could pose for potential future generators.

\n", "tags": ["code generation"], "tsne_embedding": [6.146706581115723, 9.211435317993164]}, {"key": "gros2020code", "year": "2020", "title": "Code to Comment \"Translation\": Data, Metrics, Baselining & Evaluation", "abstract": "

The relationship of comments to code, and in particular, the task of generating useful comments given the code, has long been of interest. The earliest approaches have been based on strong syntactic theories of comment-structures, and relied on textual templates. More recently, researchers have applied deep learning methods to this task, and specifically, trainable generative translation models which are known to work very well for Natural Language translation (e.g., from German to English). We carefully examine the underlying assumption here: that the task of generating comments sufficiently resembles the task of translating between natural languages, and so similar models and evaluation metrics could be used. We analyze several recent code-comment datasets for this task: CodeNN, DeepCom, FunCom, and DocString. We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators. We found some interesting differences between the code-comment data and the WMT19 natural language data. Next, we describe and conduct some studies to calibrate BLEU (which is commonly used as a measure of comment quality). using \u201caffinity pairs\u201d of methods, from different projects, in the same project, in the same class, etc; Our study suggests that the current performance on some datasets might need to be improved substantially. We also argue that fairly naive information retrieval (IR) methods do well enough at this task to be considered a reasonable baseline. Finally, we make some suggestions on how our findings might be used in future research in this area.

\n", "tags": ["bimodal", "documentation"], "tsne_embedding": [-9.670178413391113, -3.4335203170776367]}, {"key": "gu2016deep", "year": "2016", "title": "Deep API Learning", "abstract": "

Developers often wonder how to implement a certain functionality (e.g., how to parse XML files) using APIs. Obtaining an API usage sequence based on an API-related natural language query is very helpful in this regard. Given a query, existing approaches utilize information retrieval models to search for matching API sequences. These approaches treat queries and APIs as bag-of-words (i.e., keyword matching or word-to-word alignment) and lack a deep understanding of the semantics of the query.

\n\n

We propose DeepAPI, a deep learning based approach to generate API usage sequences for a given natural language query. Instead of a bags-of-words assumption, it learns the\nsequence of words in a query and the sequence of associated APIs. DeepAPI adapts a neural language model named RNN Encoder-Decoder. It encodes a word sequence (user query) into a fixed-length context vector, and generates an API sequence based on the context vector. We also augment the RNN Encoder-Decoder by considering the importance of individual APIs. We empirically evaluate our approach with more than 7 million annotated code snippets collected from GitHub. The results show that our approach generates largely accurate API sequences and outperforms the related approaches.

\n\n", "tags": ["API", "search"], "tsne_embedding": [0.6723843812942505, -16.41582489013672]}, {"key": "gu2017deepam", "year": "2017", "title": "DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning", "abstract": "

Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a large-scale code corpus without bilingual projects. The key component of DeepAM is based on the multimodal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings, when compared with the state-of-the-art approaches.

\n", "tags": ["API"], "tsne_embedding": [4.475818634033203, -18.018661499023438]}, {"key": "gu2018deep", "year": "2018", "title": "Deep Code Search", "abstract": "

To implement a program functionality, developers can reuse previously written code snippets by searching through a large-scale codebase. Over the years, many code search tools have been proposed to help developers. The existing approaches often treat source code as textual documents and utilize information retrieval models to retrieve relevant code snippets that match a given query. These approaches mainly rely on the textual similarity between source code and natural language query. They lack a deep understanding of the semantics of queries and source code.

\n\n

In this paper, we propose a novel deep neural network named CODEnn (Code-Description Embedding Neural Network). Instead of matching text similarity, CODEnn jointly embeds code snippets and natural language descriptions into a high-dimensional vector space, in such a way that code snippet and its corresponding description have similar vectors. Using the unified vector representation, code snippets related to a natural language query can be retrieved according to their vectors. Semantically related words can also be recognized and irrelevant/noisy keywords in queries can be handled.

\n\n

As a proof-of-concept application, we implement a code search tool named DeepCS using the proposed CODEnn model. We empirically evaluate DeepCS on a large scale codebase collected from GitHub. The experimental results show that our approach can effectively retrieve relevant code snippets and outperforms previous techniques.

\n\n", "tags": ["search"], "tsne_embedding": [-1.4421424865722656, -14.51733684539795]}, {"key": "gui2022cross", "year": "2022", "title": "Cross-Language Binary-Source Code Matching with Intermediate Representations", "abstract": "

Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.

\n", "tags": ["code similarity", "clone"], "tsne_embedding": [9.517804145812988, 16.698246002197266]}, {"key": "gulwani2014nlyze", "year": "2014", "title": "NLyze: Interactive Programming by Natural Language for SpreadSheet Data Analysis and Manipulation", "abstract": "

Millions of computer end users need to perform tasks over tabular spreadsheet data, yet lack the programming knowledge to do such tasks automatically. This paper describes\nthe design and implementation of a robust natural language\nbased interface to spreadsheet programming. Our methodology involves designing a typed domain-specific language\n(DSL) that supports an expressive algebra of map, filter, reduce, join, and formatting capabilities at a level of abstraction appropriate for non-expert users. The key algorithmic\ncomponent of our methodology is a translation algorithm\nfor converting a natural language specification in the context of a given spreadsheet to a ranked set of likely programs\nin the DSL. The translation algorithm leverages the spreadsheet spatial and temporal context to assign interpretations\nto specifications with implicit references, and is thus robust\nto a variety of ways in which end users can express the same\ntask. The translation algorithm builds over ideas from keyword programming and semantic parsing to achieve both\nhigh precision and high recall. We implemented the system\nas an Excel add-in called NLyze that supports a rich user\ninteraction model including annotating the user\u2019s natural\nlanguage specification and explaining the synthesized DSL\nprograms by paraphrasing them into structured English. We\ncollected a total of 3570 English descriptions for 40 spreadsheet tasks and our system was able to generate the intended\ninterpretation as the top candidate for 94% (97% for the top\n3) of those instances.

\n\n", "tags": ["code generation", "bimodal", "synthesis"], "tsne_embedding": [9.582815170288086, -1.6722452640533447]}, {"key": "guo2017semantically", "year": "2017", "title": "Semantically enhanced software traceability using deep learning techniques", "abstract": "

In most safety-critical domains the need for traceability is prescribed by certifying bodies. Trace links are generally created among requirements, design, source code, test cases and other artifacts; however, creating such links manually is time consuming and error prone. Automated solutions use information retrieval and machine learning techniques to generate trace links; however, current techniques fail to understand semantics of the software artifacts or to integrate domain knowledge into the tracing process and therefore tend to deliver imprecise and inaccurate results. In this paper, we present a solution that uses deep learning to incorporate requirements artifact semantics and domain knowledge into the tracing solution. We propose a tracing network architecture that utilizes Word Embedding and Recurrent Neural Network (RNN) models to generate trace links. Word embedding learns word vectors that represent knowledge of the domain corpus and RNN uses these word vectors to learn the sentence semantics of requirements artifacts. We trained 360 different configurations of the tracing network using existing trace links in the Positive Train Control domain and identified the Bidirectional Gated Recurrent Unit (BI-GRU) as the best model for the tracing task. BI-GRU significantly out-performed state-of-the-art tracing methods including the Vector Space Model and Latent Semantic Indexing.

\n", "tags": ["traceability", "representation"], "tsne_embedding": [-3.1759586334228516, 6.9184064865112305]}, {"key": "guo2020graphcodebert", "year": "2020", "title": "GraphCodeBERT: Pre-training Code Representations with Data Flow", "abstract": "

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of \u201cwhere-the-value-comes-from\u201d between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.

\n", "tags": ["pretraining"], "tsne_embedding": [-4.0795745849609375, -2.63449764251709]}, {"key": "guo2022learning", "year": "2022", "title": "Learning to Complete Code with Sketches", "abstract": "

Code completion is usually cast as a language modelling problem, i.e., continuing an input in a left-to-right fashion. However, in practice, some parts of the completion (e.g., string literals) may be very hard to predict, whereas subsequent parts directly follow from the context. To handle this, we instead consider the scenario of generating code completions with \u201choles\u201d inserted in places where a model is uncertain. We develop Grammformer, a Transformer-based model that guides code generation by the programming language grammar, and compare it to a variety of more standard sequence models.

\n\n

We train the models on code completion for C# and Python given partial code context. To evaluate models, we consider both ROUGE as well as a new metric RegexAcc that measures success of generating completions matching long outputs with as few holes as possible. In our experiments, Grammformer generates 10-50% more accurate completions compared to traditional generative models and 37-50% longer sketches compared to sketch-generating baselines trained with similar techniques.

\n", "tags": ["Transformer", "language model", "grammar"], "tsne_embedding": [-11.331404685974121, -13.995841026306152]}, {"key": "guo2022unixcoder", "year": "2022", "title": "UniXcoder: Unified Cross-Modal Pre-training for Code Representation", "abstract": "

Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.

\n", "tags": ["Transformer"], "tsne_embedding": [-3.3583414554595947, -1.2147923707962036]}, {"key": "guo2024deepseek", "year": "2024", "title": "DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence", "abstract": "

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.

\n", "tags": ["Transformers"], "tsne_embedding": [-1.0626811981201172, 3.4755656719207764]}, {"key": "gupta2017deepfix", "year": "2017", "title": "DeepFix: Fixing Common C Language Errors by Deep Learning", "abstract": "

The problem of automatically fixing programming errors is a\nvery active research topic in software engineering. This is a\nchallenging problem as fixing even a single error may require\nanalysis of the entire program. In practice, a number of errors\narise due to programmer\u2019s inexperience with the programming language or lack of attention to detail. We call these\ncommon programming errors. These are analogous to grammatical errors in natural languages. Compilers detect such errors, but their error messages are usually inaccurate. In this\nwork, we present an end-to-end solution, called DeepFix, that\ncan fix multiple such errors in a program without relying on\nany external tool to locate or fix them. At the heart of DeepFix\nis a multi-layered sequence-to-sequence neural network with\nattention which is trained to predict erroneous program locations along with the required correct statements. On a set of\n6971 erroneous C programs written by students for 93 programming tasks, DeepFix could fix 1881 (27%) programs\ncompletely and 1338 (19%) programs partially.

\n", "tags": ["repair", "code generation"], "tsne_embedding": [21.71475601196289, -3.2107796669006348]}, {"key": "gupta2018deep", "year": "2018", "title": "Deep Reinforcement Learning for Programming Language Correction", "abstract": "

Novice programmers often struggle with the formal\nsyntax of programming languages. To assist them,\nwe design a novel programming language correction framework amenable to reinforcement learning. The framework allows an agent to mimic human actions for text navigation and editing. We\ndemonstrate that the agent can be trained through\nself-exploration directly from the raw input, that is,\nprogram text itself, without any knowledge of the\nformal syntax of the programming language. We\nleverage expert demonstrations for one tenth of the\ntraining data to accelerate training. The proposed\ntechnique is evaluated on 6975\nerroneous C programs with typographic errors, written by students\nduring an introductory programming course. Our\ntechnique fixes 14%\nmore programs and 29% more\ncompiler error messages relative to those fixed by\na state-of-the-art tool, DeepFix, which uses a fully\nsupervised neural machine translation approach.

\n", "tags": ["repair", "code generation"], "tsne_embedding": [21.85055923461914, -3.9716861248016357]}, {"key": "gupta2018intelligent", "year": "2018", "title": "Intelligent code reviews using deep learning", "abstract": "

Peer code review is a best practice in Software Engineering where source code is reviewed manually by one or more peers(reviewers) of the code author. It is widely acceptable both in industry and open-source software (OSS) systems as a process for early detection and reduction of software defects. A larger chunk of reviews given during peer reviews are related to common issues such as coding style, documentations, and best practices. This makes the code review process less effective as reviewers focus less on finding important defects. Hence, there is a need to automatically find such common issues and help reviewers perform focused code reviews. Some of this is solved by rule based systems called linters but they are rigid and needs a lot of manual effort to adapt them for a new issue.

\n\n

In this work, we present an automatic, flexible, and adaptive code analysis system called DeepCodeReviewer (DCR). DCR learns how to recommend code reviews related to common issues using historical peer reviews and deep learning. DCR uses deep learning to learn review relevance to a code snippet and recommend the right review from a repository of common reviews. DCR is trained on histroical peer reviews available from internal code repositories at Microsoft. Experiments demonstrate strong performance of developed deep learning model in classifying relevant and non-relevant reviews w.r.t to a code snippet, and ranking reviews given a code snippet. We have also evaluated DCR recommentations using a user study and survey. The results of our user study show good acceptance rate and answers of our survey questions are strongly correlated with our system\u2019s goal of making code reviews focused on finding defects.

\n", "tags": ["representation", "review"], "tsne_embedding": [-8.297481536865234, 2.3598275184631348]}, {"key": "gupta2019neural", "year": "2019", "title": "Neural Attribution for Semantic Bug-Localization in Student Programs", "abstract": "

Providing feedback is an integral part of teaching. Most open online courses on programming make use of automated grading systems to support programming assignments and give real-time feedback. These systems usually rely on test results to quantify the programs\u2019 functional correctness. They return failing tests to the students as feedback. However, students may find it difficult to debug their programs if they receive no hints about where the bug is and how to fix it. In this work, we present NeuralBugLocator, a deep learning based technique, that can localize the bugs in a faulty program with respect to a failing test, without even running the program. At the heart of our technique is a novel tree convolutional neural network which is trained to predict whether a program passes or fails a given test. To localize the bugs, we analyze the trained network using a state-of-the-art neural prediction attribution technique and see which lines of the programs make it predict the test outcomes. Our experiments show that NeuralBugLocator is generally more accurate than two state-of-the-art program-spectrum based and one syntactic difference based bug-localization baselines.

\n", "tags": ["defect", "representation"], "tsne_embedding": [15.704065322875977, 4.30213737487793]}, {"key": "gupta2023grace", "year": "2023", "title": "Grace: Language Models Meet Code Edits", "abstract": "

Developers spend a significant amount of time in editing code for a variety of reasons such as bug fixing or adding new features. Designing effective methods to predict code edits has been an active yet challenging area of research due to the diversity of code edits and the difficulty of capturing the developer intent. In this work, we address these challenges by endowing pre-trained large language models (LLMs) with the knowledge of relevant prior associated edits, which we call the Grace (Generation conditioned on Associated Code Edits) method. The generative capability of the LLMs helps address the diversity in code changes and conditioning code generation on prior edits helps capture the latent developer intent. We evaluate two well-known LLMs, codex and CodeT5, in zero-shot and fine-tuning settings respectively. In our experiments with two datasets, Grace boosts the performance of the LLMs significantly, enabling them to generate 29% and 54% more correctly edited code in top-1 suggestions relative to the current state-of-the-art symbolic and neural approaches, respectively.

\n", "tags": ["editing"], "tsne_embedding": [-1.866490364074707, 1.229974627494812]}, {"key": "gvero2015synthesizing", "year": "2015", "title": "Synthesizing Java expressions from free-form queries", "abstract": "

We present a new code assistance tool for integrated development environments. Our system accepts as input free-form queries containing a mixture of English and Java, and produces Java code expressions that take the query into account and respect syntax, types, and scoping rules of Java, as well as statistical usage patterns. In contrast to solutions based on code search, the results returned by our tool need not directly correspond to any previously seen code fragment. As part of our system we have constructed a probabilistic context free grammar for Java constructs and library invocations, as well as an algorithm that uses a customized natural language processing tool chain to extract information from free-form text queries. We present the results on a number of examples showing that our technique (1) often produces the expected code fragments, (2) tolerates much of the flexibility of natural language, and (3) can repair incorrect Java expressions that use, for example, the wrong syntax or missing arguments.

\n", "tags": ["synthesis", "code generation", "bimodal"], "tsne_embedding": [-13.727494239807129, -19.741004943847656]}, {"key": "habib2019neural", "year": "2019", "title": "Neural Bug Finding: A Study of Opportunities and Challenges", "abstract": "

Static analysis is one of the most widely adopted techniques to find software bugs before code is put in production. Designing and implementing effective and efficient static analyses is difficult and requires high expertise, which results in only a few experts able to write such analyses. This paper explores the opportunities and challenges of an alternative way of creating static bug detectors: neural bug finding. The basic idea is to formulate bug detection as a classification problem, and to address this problem with neural networks trained on examples of buggy and non-buggy code. We systematically study the effectiveness of this approach based on code examples labeled by a state-of-the-art, static bug detector. Our results show that neural bug finding is surprisingly effective for some bug patterns, sometimes reaching a precision and recall of over 80%, but also that it struggles to understand some program properties obvious to a traditional analysis. A qualitative analysis of the results provides insights into why neural bug finders sometimes work and sometimes do not work. We also identify pitfalls in selecting the code examples used to train and validate neural bug finders, and propose an algorithm for selecting effective training data.

\n", "tags": ["program analysis"], "tsne_embedding": [16.626909255981445, 5.1609086990356445]}, {"key": "hajipour2019samplefix", "year": "2019", "title": "SampleFix: Learning to Correct Programs by Sampling Diverse Fixes", "abstract": "

Automatic program correction is an active topic of research, which holds the potential of dramatically improving productivity of programmers during the software development process and correctness of software in general. Recent advances in machine learning, deep learning and NLP have rekindled the hope to eventually fully automate the process of repairing programs. A key challenges is ambiguity, as multiple codes \u2013 or fixes \u2013 can implement the same functionality. In addition, dataset by nature fail to capture the variance introduced by such ambiguities. Therefore, we propose a deep generative model to automatically correct programming errors by learning a distribution of potential fixes. Our model is formulated as a deep conditional variational autoencoder that samples diverse fixes for the given erroneous programs. In order to account for ambiguity and inherent lack of representative datasets, we propose a novel regularizer to encourage the model to generate diverse fixes. Our evaluations on common programming errors show for the first time the generation of diverse fixes and strong improvements over the state-of-the-art approaches by fixing up to 61% of the mistakes.

\n", "tags": ["repair", "code generation"], "tsne_embedding": [21.98748016357422, -2.4595627784729004]}, {"key": "haldar2020multiperspective", "year": "2020", "title": "A Multi-Perspective Architecture for Semantic Code Search", "abstract": "

The ability to match pieces of code to their corresponding natural language descriptions and vice versa is fundamental for natural language search interfaces to software repositories. In this paper, we propose a novel multi-perspective cross-lingual neural framework for code\u2013text matching, inspired in part by a previous model for monolingual text-to-text matching, to capture both global and local similarities. Our experiments on the CoNaLa dataset show that our proposed model yields better performance on this cross-lingual text-to-code matching task than previous approaches that map code and text to a single joint embedding space.

\n", "tags": ["search"], "tsne_embedding": [-1.177133321762085, -13.16500473022461]}, {"key": "haque2020improved", "year": "2020", "title": "Improved Automatic Summarization of Subroutines via Attention to File Context", "abstract": "

Software documentation largely consists of short, natural language summaries of the subroutines in the software. These summaries help programmers quickly understand what a subroutine does without having to read the source code him or herself. The task of writing these descriptions is called \u201csource code summarization\u201d and has been a target of research for several years. Recently, AI-based approaches have superseded older, heuristic-based approaches. Yet, to date these AI-based approaches assume that all the content needed to predict summaries is inside subroutine itself. This assumption limits performance because many subroutines cannot be understood without surrounding context. In this paper, we present an approach that models the file context of subroutines (i.e. other subroutines in the same file) and uses an attention mechanism to find words and concepts to use in summaries. We show in an experiment that our approach extends and improves several recent baselines.

\n", "tags": ["summarization"], "tsne_embedding": [-16.5307674407959, -8.915608406066895]}, {"key": "haque2022semantic", "year": "2022", "title": "Semantic Similarity Metrics for Evaluating Source Code Summarization", "abstract": "

Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of examples of code and summaries of that code are used to train an e.g. encoder-decoder neural model. Then the output predictions of the model are evaluated against a set of reference summaries. The input is code not seen by the model, and the prediction is compared to a reference. The means by which a prediction is compared to a reference is essentially word overlap, calculated via a metric such as BLEU or ROUGE. The problem with using word overlap is that not all words in a sentence have the same importance, and many words have synonyms. The result is that calculated similarity may not match the perceived similarity by human readers. In this paper, we conduct an experiment to measure the degree to which various word overlap metrics correlate to human-rated similarity of predicted and reference summaries. We evaluate alternatives based on current work in semantic similarity metrics and propose recommendations for evaluation of source code summarization.

\n", "tags": ["human evaluation", "evaluation"], "tsne_embedding": [-14.834105491638184, -12.445379257202148]}, {"key": "harer2018learning", "year": "2018", "title": "Learning to Repair Software Vulnerabilities with Generative Adversarial Networks", "abstract": "

Motivated by the problem of automated repair of software vulnerabilities, we propose an adversarial learning approach that maps from one discrete source domain to another target domain without requiring paired labeled examples or source and target domains to be bijections. We demonstrate that the proposed adversarial learning approach is an effective technique for repairing software vulnerabilities, performing close to seq2seq approaches that require labeled pairs. The proposed Generative Adversarial Network approach is application-agnostic in that it can be applied to other problems similar to code repair, such as grammar correction or sentiment translation.

\n", "tags": ["repair", "code generation"], "tsne_embedding": [23.286895751953125, -1.0894263982772827]}, {"key": "hashimoto2018retrieve", "year": "2018", "title": "A Retrieve-and-Edit Framework for Predicting Structured Outputs", "abstract": "

For the task of generating complex outputs such as source code, editing existing\noutputs can be easier than generating complex outputs from scratch. With this\nmotivation, we propose an approach that first retrieves a training example based on\nthe input (e.g., natural language description) and then edits it to the desired output\n(e.g., code). Our contribution is a computationally efficient method for learning\na retrieval model that embeds the input in a task-dependent way without relying\non a hand-crafted metric or incurring the expense of jointly training the retriever\nwith the editor. Our retrieve-and-edit framework can be applied on top of any\nbase model. We show that on a new autocomplete task for GitHub Python code\nand the Hearthstone cards benchmark, retrieve-and-edit significantly boosts the\nperformance of a vanilla sequence-to-sequence model on both tasks.

\n", "tags": ["bimodal", "search", "code generation"], "tsne_embedding": [-10.932356834411621, -0.3087027072906494]}, {"key": "hata2018learning", "year": "2018", "title": "Learning to Generate Corrective Patches using Neural Machine Translation", "abstract": "

Bug fixing is generally a manually-intensive task. However, recent work has proposed the idea of automated program repair, which aims to repair (at least a subset of) bugs in different ways such as code mutation, etc. Following in the same line of work as automated bug repair, in this paper we aim to leverage past fixes to propose fixes of current/future bugs. Specifically, we propose Ratchet, a corrective patch generation system using neural machine translation. By learning corresponding pre-correction and post-correction code in past fixes with a neural sequence-to-sequence model, Ratchet is able to generate a fix code for a given bug-prone code query. We perform an empirical study with five open source projects, namely Ambari, Camel, Hadoop, Jetty and Wicket, to evaluate the effectiveness of Ratchet. Our findings show that Ratchet can generate syntactically valid statements 98.7% of the time, and achieve an F1-measure between 0.41-0.83 with respect to the actual fixes adopted in the code base. In addition, we perform a qualitative validation using 20 participants to see whether the generated statements can be helpful in correcting bugs. Our survey showed that Ratchet\u2019s output was considered to be helpful in fixing the bugs on many occasions, even if fix was not 100% correct.

\n", "tags": ["repair", "code generation"], "tsne_embedding": [19.799968719482422, 0.14845487475395203]}, {"key": "hazoom2021text", "year": "2021", "title": "Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data", "abstract": "

Most available semantic parsing datasets, comprising of pairs of natural utterances and logical forms, were collected solely for the purpose of training and evaluation of natural language understanding systems. As a result, they do not contain any of the richness and variety of natural-occurring utterances, where humans ask about data they need or are curious about. In this work, we release SEDE, a dataset with 12,023 pairs of utterances and SQL queries collected from real usage on the Stack Exchange website. We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset, propose an evaluation metric based on comparison of partial query clauses that is more suitable for real-world queries, and conduct experiments with strong baselines, showing a large gap between the performance on SEDE compared to other common datasets.

\n", "tags": ["dataset"], "tsne_embedding": [-19.604822158813477, -19.20011329650879]}, {"key": "he2019learning", "year": "2019", "title": "Learning to Fuzz from Symbolic Execution with Application to Smart Contracts", "abstract": "

Fuzzing and symbolic execution are two complementary techniques for discovering software vulnerabilities. Fuzzing is fast and scalable, but can be ineffective when it fails to randomly select the right inputs. Symbolic execution is thorough but slow and often does not scale to deep program paths with complex path conditions. In this work, we propose to learn an effective and fast fuzzer from symbolic execution, by phrasing the learning task in the framework of imitation learning. During learning, a symbolic execution expert generates a large number of quality inputs improving coverage on thousands of programs. Then, a fuzzing policy, represented with a suitable architecture of neural networks, is trained on the generated dataset. The learned policy can then be used to fuzz new programs. We instantiate our approach to the problem of fuzzing smart contracts, a domain where contracts often implement similar functionality (facilitating learning) and security is of utmost importance. We present an end-to-end system, ILF (for Imitation Learning based Fuzzer), and an extensive evaluation over >18K contracts. Our results show that ILF is effective: (i) it is fast, generating 148 transactions per second, (ii) it outperforms existing fuzzers (e.g., achieving 33% more coverage), and (iii) it detects more vulnerabilities than existing fuzzing and symbolic execution tools for Ethereum.

\n", "tags": ["fuzzing", "GNN"], "tsne_embedding": [16.381404876708984, 13.279576301574707]}, {"key": "he2021learning", "year": "2021", "title": "Learning to Find Naming Issues with Big Code and Small Supervision", "abstract": "

We introduce a new approach for finding and fixing naming\nissues in source code. The method is based on a careful\ncombination of unsupervised and supervised procedures: (i)\nunsupervised mining of patterns from Big Code that express\ncommon naming idioms. Program fragments violating such\nidioms indicates likely naming issues, and (ii) supervised\nlearning of a classifier on a small labeled dataset which filters\npotential false positives from the violations.

\n\n

We implemented our method in a system called\nNamer and evaluated it on a large number of Python and Java programs.\nWe demonstrate that Namer is effective in finding naming mistakes\nin real world repositories with high precision (\u223c70%).\nPerhaps surprisingly, we also show that existing deep learning methods\nare not practically effective and achieve low precision in finding naming issues (up to \u223c16%).

\n", "tags": ["repair"], "tsne_embedding": [12.313630104064941, -6.977391719818115]}, {"key": "he2022distribution", "year": "2022", "title": "On Distribution Shift in Learning-based Bug Detectors", "abstract": "

Deep learning has recently achieved initial success in program analysis tasks such as bug detection. Lacking real bugs, most existing works construct training and test data by injecting synthetic bugs into correct programs. Despite achieving high test accuracy (e.g. >90%), the resulting bug detectors are found to be surprisingly unusable in practice, i.e., <10% precision when used to scan real software repositories. In this work, we argue that this massive performance difference is caused by distribution shift, i.e., a fundamental mismatch between the real bug distribution and the synthetic bug distribution used to train and evaluate the detectors. To address this key challenge, we propose to train a bug detector in two phases, first on a synthetic bug distribution to adapt the model to the bug detection domain, and then on a real bug distribution to drive the model towards the real distribution. During these two phases, we leverage a multi-task hierarchy, focal loss, and contrastive learning to further boost performance. We evaluate our approach extensively on three widely studied bug types, for which we construct new datasets carefully designed to capture the real bug distribution. The results demonstrate that our approach is practically effective and successfully mitigates the distribution shift: our learned detectors are highly performant on both our constructed test set and the latest version of open source repositories.

\n", "tags": ["defect"], "tsne_embedding": [19.595779418945312, 5.000174522399902]}, {"key": "hellendoorn2015will", "year": "2015", "title": "Will they like this? Evaluating Code Contributions With Language Models", "abstract": "

Popular open-source software projects receive and\nreview contributions from a diverse array of developers, many\nof whom have little to no prior involvement with the project. A\nrecent survey reported that reviewers consider conformance to\nthe project\u2019s code style to be one of the top priorities when evaluating code contributions on Github. We propose to quantitatively\nevaluate the existence and effects of this phenomenon. To this aim\nwe use language models, which were shown to accurately capture\nstylistic aspects of code. We find that rejected changesets do\ncontain code significantly less similar to the project than accepted\nones; furthermore, the less similar changesets are more likely\nto be subject to thorough review. Armed with these results we\nfurther investigate whether new contributors learn to conform to\nthe project style and find that experience is positively correlated\nwith conformance to the project\u2019s code style.

\n", "tags": ["review", "language model"], "tsne_embedding": [-23.303054809570312, -12.451528549194336]}, {"key": "hellendoorn2017deep", "year": "2017", "title": "Are Deep Neural Networks the Best Choice for Modeling Source Code?", "abstract": "

Current statistical language modeling techniques, including deep-learning based models, have proven to be quite effective for source\ncode. We argue here that the special properties of source code can\nbe exploited for further improvements. In this work, we enhance\nestablished language modeling approaches to handle the special\nchallenges of modeling source code, such as: frequent changes,\nlarger, changing vocabularies, deeply nested scopes, etc. We present\na fast, nested language modeling toolkit specifically designed for\nsoftware, with the ability to add & remove text, and mix & swap out\nmany models. Specifically, we improve upon prior cache-modeling\nwork and present a model with a much more expansive, multi-level\nnotion of locality that we show to be well-suited for modeling\nsoftware. We present results on varying corpora in comparison\nwith traditional N -gram, as well as RNN, and LSTM deep-learning\nlanguage models, and release all our source code for public use.\nOur evaluations suggest that carefully adapting N-gram models for\nsource code can yield performance that surpasses even RNN and\nLSTM based deep-learning models.

\n", "tags": ["language model"], "tsne_embedding": [-3.680478572845459, 4.181453704833984]}, {"key": "hellendoorn2018deep", "year": "2018", "title": "Deep Learning Type Inference", "abstract": "

Dynamically typed languages such as JavaScript and Python are\nincreasingly popular, yet static typing has not been totally eclipsed:\nPython now supports type annotations and languages like TypeScript offer a middle-ground for JavaScript: a strict superset of\nJavaScript, to which it transpiles, coupled with a type system that\npermits partially typed programs. However, static typing has a cost:\nadding annotations, reading the added syntax, and wrestling with\nthe type system to fix type errors. Type inference can ease the\ntransition to more statically typed code and unlock the benefits of\nricher compile-time information, but is limited in languages like\nJavaScript as it cannot soundly handle duck-typing or runtime evaluation\nvia eval. We propose DeepTyper, a deep learning model\nthat understands which types naturally occur in certain contexts\nand relations and can provide type suggestions, which can often\nbe verified by the type checker, even if it could not infer the type\ninitially. DeepTyper, leverages an automatically aligned corpus\nof tokens and types to accurately predict thousands of variable\nand function type annotations. Furthermore, we demonstrate that\ncontext is key in accurately assigning these types and introduce a\ntechnique to reduce overfitting on local cues while highlighting the\nneed for further improvements. Finally, we show that our model\ncan interact with a compiler to provide more than 4,000 additional\ntype annotations with over 95% precision that could not be inferred\nwithout the aid of DeepTyper.

\n", "tags": ["representation", "types"], "tsne_embedding": [-2.918985605239868, 27.49640464782715]}, {"key": "hellendoorn2020global", "year": "2020", "title": "Global Relational Models of Source Code", "abstract": "

Models of code can learn distributed representations of a program\u2019s syntax and semantics to predict many non-trivial properties of a program. Recent state-of-the-art models leverage highly structured representations of programs, such as trees, graphs and paths therein (e.g. data-flow relations), which are precise and abundantly available for code. This provides a strong inductive bias towards semantically meaningful relations, yielding more generalizable representations than classical sequence-based models. Unfortunately, these models primarily rely on graph-based message passing to represent relations in code, which makes them de facto local due to the high cost of message-passing steps, quite in contrast to modern, global sequence-based models, such as the Transformer. In this work, we bridge this divide between global and structured models by introducing two new hybrid model families that are both global and incorporate structural bias: Graph Sandwiches, which wrap traditional (gated) graph message-passing layers in sequential message-passing layers; and Graph Relational Embedding Attention Transformers (GREAT for short), which bias traditional Transformers with relational information from graph edge types. By studying a popular, non-trivial program repair task, variable-misuse identification, we explore the relative merits of traditional and hybrid model families for code representation. Starting with a graph-based model that already improves upon the prior state-of-the-art for this task by 20%, we show that our proposed hybrid models improve an additional 10-15%, while training both faster and using fewer parameters.

\n", "tags": ["variable misuse", "defect", "GNN", "Transformer"], "tsne_embedding": [-1.0414292812347412, 11.484002113342285]}, {"key": "henkel2020semantic", "year": "2022", "title": "Semantic Robustness of Models of Source Code", "abstract": "

Deep neural networks are vulnerable to adversarial examples - small input perturbations that result in incorrect predictions. We study this problem for models of source code, where we want the neural network to be robust to source-code modifications that preserve code functionality. To facilitate training robust models, we define a powerful and generic adversary that can employ sequences of parametric, semantics-preserving program transformations. We then explore how, with such an adversary, one can train models that are robust to adversarial program transformations. We conduct a thorough evaluation of our approach and find several surprising facts: we find robust training to beat dataset augmentation in every evaluation we performed; we find that a state-of-the-art architecture (code2seq) for models of code is harder to make robust than a simpler baseline; additionally, we find code2seq to have surprising weaknesses not present in our simpler baseline model; finally, we find that robust models perform better against unseen data from different sources (as one might hope) - however, we also find that robust models are not clearly better in the cross-language transfer task. To the best of our knowledge, we are the first to study the interplay between robustness of models of code and the domain-adaptation and cross-language transfer tasks.

\n", "tags": ["adversarial", "naming"], "tsne_embedding": [9.893157958984375, 21.294607162475586]}, {"key": "heyman2020neural", "year": "2020", "title": "Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent", "abstract": "

In this work, we propose and study annotated code search: the retrieval of code snippets paired with brief descriptions of their intent using natural language queries. On three benchmark datasets, we investigate how code retrieval systems can be improved by leveraging descriptions to better capture the intents of code snippets. Building on recent progress in transfer learning and natural language processing, we create a domain-specific retrieval model for code annotated with a natural language description. We find that our model yields significantly more relevant search results (with absolute gains up to 20.6% in mean reciprocal rank) compared to state-of-the-art code retrieval methods that do not use descriptions but attempt to compute the intent of snippets solely from unannotated code.

\n", "tags": ["search"], "tsne_embedding": [-3.5553314685821533, -14.014030456542969]}, {"key": "hindle2012naturalness", "year": "2012", "title": "On the Naturalness of Software", "abstract": "

Natural languages like English are rich, complex,\nand powerful. The highly creative and graceful use of languages\nlike English and Tamil, by masters like Shakespeare and\nAvvaiyar, can certainly delight and inspire. But in practice,\ngiven cognitive constraints and the exigencies of daily life, most\nhuman utterances are far simpler and much more repetitive\nand predictable. In fact, these utterances can be very usefully\nmodeled using modern statistical methods. This fact has led\nto the phenomenal success of statistical approaches to speech\nrecognition, natural language translation, question-answering,\nand text mining and comprehension.

\n\n

We begin with the conjecture that most software is also\nnatural, in the sense that it is created by humans at work,\nwith all the attendant constraints and limitations\u2014and thus,\nlike natural language, it is also likely to be repetitive and\npredictable. We then proceed to ask whether a) code can\nbe usefully modeled by statistical language models and b)\nsuch models can be leveraged to support software engineers.\nUsing the widely adopted n-gram model, we provide empirical\nevidence supportive of a positive answer to both these questions.\nWe show that code is also very repetitive, and in fact even more\nso than natural languages. As an example use of the model,\nwe have developed a simple code completion engine for Java\nthat, despite its simplicity, already improves Eclipse\u2019s built-in\ncompletion capability. We conclude the paper by laying out a\nvision for future research in this area.

\n\n", "tags": ["language model", "autocomplete"], "tsne_embedding": [-14.641676902770996, -17.56171417236328]}, {"key": "hoang2020cc2vec", "year": "2020", "title": "CC2Vec: Distributed Representations of Code Changes", "abstract": "

Existing work on software patches often use features specific to a single task. These works often rely on manually identified features, and human effort is required to identify these features for each task. In this work, we propose CC2Vec, a neural network model that learns a representation of code changes guided by their accompanying log messages, which represent the semantic intent of the code changes. CC2Vec models the hierarchical structure of a code change with the help of the attention mechanism and uses multiple comparison functions to identify the differences between the removed and added code.

\n\n

To evaluate if CC2Vec can produce a distributed representation of code changes that is general and useful for multiple tasks on software patches, we use the vectors produced by CC2Vec for three tasks: log message generation, bug fixing patch identification, and just-in-time defect prediction. In all tasks, the models using CC2Vec outperform the state-of-the-art techniques.

\n", "tags": ["edit"], "tsne_embedding": [-13.873098373413086, 4.35662841796875]}, {"key": "hong2021fix", "year": "2021", "title": "Fix-Filter-Fix: Intuitively Connect Any Models for Effective Bug Fixing", "abstract": "

Locating and fixing bugs is a time-consuming task. Most neural machine translation (NMT) based approaches for automatically bug fixing lack generality and do not make full use of the rich information in the source code. In NMT-based bug fixing, we find some predicted code identical to the input buggy code (called unchanged fix) in NMT-based approaches due to high similarity between buggy and fixed code (e.g., the difference may only appear in one particular line). Obviously, unchanged fix is not the correct fix because it is the same as the buggy code that needs to be fixed. Based on these, we propose an intuitive yet effective general framework (called Fix-Filter-Fix or F\u02c63) for bug fixing. F\u02c63 connects models with our filter mechanism to filter out the last model\u2019s unchanged fix to the next. We propose an F\u02c63 theory that can quantitatively and accurately calculate the F\u02c63 lifting effect. To evaluate, we implement the Seq2Seq Transformer (ST) and the AST2Seq Transformer (AT) to form some basic F\u02c63 instances, called F\u02c63_ST+AT and F\u02c63_AT+ST. Comparing them with single model approaches and many model connection baselines across four datasets validates the effectiveness and generality of F\u02c63 and corroborates our findings and methodology.

\n", "tags": ["repair"], "tsne_embedding": [19.0999698638916, -0.32820484042167664]}, {"key": "hsiao2014using", "year": "2014", "title": "Using Web Corpus Statistics for Program Analysis", "abstract": "

Several program analysis tools\u2014such as plagiarism detection and bug finding\u2014rely on knowing a piece of code\u2019s\nrelative semantic importance. For example, a plagiarism detector should not bother reporting two programs that have\nan identical simple loop counter test, but should report programs that share more distinctive code. Traditional program\nanalysis techniques (e.g., finding data and control dependencies) are useful, but do not say how surprising or common\na line of code is. Natural language processing researchers\nhave encountered a similar problem and addressed it using\nan n-gram model of text frequency, derived from statistics\ncomputed over text corpora.

\n\n

We propose and compute an n-gram model for programming languages, computed over a corpus of 2.8 million\nJavaScript programs we downloaded from the Web. In contrast to previous techniques, we describe a code n-gram as\na subgraph of the program dependence graph that contains\nall nodes and edges reachable in n steps from the statement.\nWe can count n-grams in a program and count the frequency\nof n-grams in the corpus, enabling us to compute tf-idf-style\nmeasures that capture the differing importance of different\nlines of code. We demonstrate the power of this approach by\nimplementing a plagiarism detector with accuracy that beats\nprevious techniques, and a bug-finding tool that discovered\nover a dozen previously unknown bugs in a collection of real\ndeployed programs.

\n", "tags": ["defect"], "tsne_embedding": [8.75256061553955, -10.851241111755371]}, {"key": "hu2017codesum", "year": "2017", "title": "CodeSum: Translate Program Language to Natural Language", "abstract": "

During software maintenance, programmers spend a lot of time on code comprehension. Reading comments is an effective way for programmers to reduce the reading and navigating time when comprehending source code. Therefore, as a critical task in software engineering, code summarization aims to generate brief natural language descriptions for source code. In this paper, we propose a new code summarization model named CodeSum. CodeSum exploits the attention-based sequence-to-sequence (Seq2Seq) neural network with Structure-based Traversal (SBT) of Abstract Syntax Trees (AST). The AST sequences generated by SBT can better present the structure of ASTs and keep unambiguous. We conduct experiments on three large-scale corpora in different program languages, i.e., Java, C#, and SQL, in which Java corpus is our new proposed industry code extracted from Github. Experimental results show that our method CodeSum outperforms the state-of-the-art significantly.

\n", "tags": ["bimodal", "summarization"], "tsne_embedding": [-15.221452713012695, -6.558513641357422]}, {"key": "huang2021cosqa", "year": "2021", "title": "CoSQA: 20,000+ Web Queries for Code Search and Question Answering", "abstract": "

Finding codes given natural language query is beneficial to the productivity of software developers.\nFuture progress towards better semantic matching between query and code requires richer supervised training resources.\nTo remedy this, we introduce the CoSQA dataset. It includes 20,604 labels for pairs of natural language queries and codes,\neach annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance query-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%, and incorporating CoCLR brings a further improvement of 10.5%.

\n", "tags": ["dataset", "search"], "tsne_embedding": [-4.904903411865234, -13.608297348022461]}, {"key": "husain2019codesearchnet", "year": "2019", "title": "CodeSearchNet Challenge: Evaluating the State of Semantic Code Search", "abstract": "

Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas.

\n\n

To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task.

\n\n

We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future.

\n", "tags": ["dataset", "search"], "tsne_embedding": [-5.453479290008545, -14.941607475280762]}, {"key": "hussain2019deep", "year": "2019", "title": "Deep Transfer Learning for Source Code Modeling", "abstract": "

In recent years, deep learning models have shown great potential in source code modeling and analysis. Generally, deep learning-based approaches are problem-specific and data-hungry. A challenging issue of these approaches is that they require training from starch for a different related problem. In this work, we propose a transfer learning-based approach that significantly improves the performance of deep learning-based source code models. In contrast to traditional learning paradigms, transfer learning can transfer the knowledge learned in solving one problem into another related problem. First, we present two recurrent neural network-based models RNN and GRU for the purpose of transfer learning in the domain of source code modeling. Next, via transfer learning, these pre-trained (RNN and GRU) models are used as feature extractors. Then, these extracted features are combined into attention learner for different downstream tasks. The attention learner leverages from the learned knowledge of pre-trained models and fine-tunes them for a specific downstream task. We evaluate the performance of the proposed approach with extensive experiments with the source code suggestion task. The results indicate that the proposed approach outperforms the state-of-the-art models in terms of accuracy, precision, recall, and F-measure without training the models from scratch.

\n", "tags": ["pretraining"], "tsne_embedding": [-7.455807209014893, -1.6913701295852661]}, {"key": "iyer2016summarizing", "year": "2016", "title": "Summarizing Source Code using a Neural Attention Model", "abstract": "

High quality source code is often paired\nwith high level summaries of the computation it performs, for example in code\ndocumentation or in descriptions posted\nin online forums. Such summaries are\nextremely useful for applications such as\ncode search but are expensive to manually\nauthor, hence only done for a small fraction of all code that is produced. In this\npaper, we present the first completely data-driven approach for generating high level\nsummaries of source code. Our model,\nCODE-NN , uses Long Short Term Memory (LSTM) networks with attention to\nproduce sentences that describe C# code\nsnippets and SQL queries. CODE-NN\nis trained on a new corpus that is automatically collected from StackOverflow,\nwhich we release. Experiments demonstrate strong performance on two tasks:\n(1) code summarization, where we establish the first end-to-end learning results\nand outperform strong baselines, and (2)\ncode retrieval, where our learned model\nimproves the state of the art on a recently\nintroduced C# benchmark by a large margin.

\n", "tags": ["summarization", "bimodal"], "tsne_embedding": [-15.059619903564453, -7.80015754699707]}, {"key": "iyer2018mapping", "year": "2018", "title": "Mapping Language to Code in Programmatic Context", "abstract": "

Source code is rarely written in isolation. It depends significantly on the programmatic context, such as the class that the code would reside in. To study this phenomenon, we introduce the task of generating class member functions given English documentation and the programmatic context provided by the rest of the class. This task is challenging because the desired code can vary greatly depending on the functionality the class provides (e.g., a sort function may or may not be available when we are asked to \u201creturn the smallest element\u201d in a particular member variable list). We introduce CONCODE, a new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment. We also present a detailed error analysis suggesting that there is significant room for future work on this task.

\n", "tags": ["bimodal", "code generation"], "tsne_embedding": [-12.75064754486084, -20.8382568359375]}, {"key": "iyer2019learning", "year": "2019", "title": "Learning Programmatic Idioms for Scalable Semantic Parsing", "abstract": "

Programmers typically organize executable source code using high-level coding patterns or idiomatic structures such as nested loops, exception handlers and recursive blocks, rather than as individual code tokens. In contrast, state of the art semantic parsers still map natural language instructions to source code by building the code syntax tree one node at a time. In this paper, we introduce an iterative method to extract code idioms from large source code corpora by repeatedly collapsing most-frequent depth-2 subtrees of their syntax trees, and we train semantic parsers to apply these idioms during decoding. We apply this idiom-based code generation to a recent context-dependent semantic parsing task, and improve the state of the art by 2.2% BLEU score while reducing training time by more than 50%. This improved speed enables us to scale up the model by training on an extended training set that is 5x times larger, to further move up the state of the art by an additional 2.3% BLEU and 0.9% exact match.

\n", "tags": ["pattern mining", "code generation", "grammar"], "tsne_embedding": [11.25457763671875, -14.221555709838867]}, {"key": "jain2020contrastive", "year": "2020", "title": "Contrastive Code Representation Learning", "abstract": "

Machine-aided programming tools such as type predictors and code summarizers\nare increasingly learning-based. However, most code representation learning approaches rely on supervised learning with task-specific annotated datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised\nalgorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, relying only on\nthe raw text of programs. In particular, we design an unsupervised pretext task by\ngenerating textually divergent copies of source functions via automated source-tosource compiler transforms that preserve semantics. We train a neural model to\nidentify variants of an anchor program within a large batch of negatives. To solve\nthis task, the network must extract program features representing the functionality,\nnot form, of the program. This is the first application of instance discrimination\nto code representation learning to our knowledge. We pre-train models over 1.8m\nunannotated JavaScript methods mined from GitHub. ContraCode pre-training\nimproves code summarization accuracy by 7.9% over supervised approaches and\n4.8% over RoBERTa pre-training. Moreover, our approach is agnostic to model architecture; for a type inference task, contrastive pre-training consistently improves\nthe accuracy of existing baselines.

\n", "tags": ["representation", "pretraining"], "tsne_embedding": [-5.008065700531006, -0.3335019648075104]}, {"key": "jayasundara2019treecaps", "year": "2019", "title": "TreeCaps: Tree-Structured Capsule Networks for Program Source Code Processing", "abstract": "

Program comprehension is a fundamental task in software development and maintenance processes. Software developers often need to understand a large amount of existing code before they can develop new features or fix bugs in existing programs. Being able to process programming language code automatically and provide summaries of code functionality accurately can significantly help developers to reduce time spent in code navigation and understanding, and thus increase productivity. Different from natural language articles, source code in programming languages often follows rigid syntactical structures and there can exist dependencies among code elements that are located far away from each other through complex control flows and data flows. Existing studies on tree-based convolutional neural networks (TBCNN) and gated graph neural networks (GGNN) are not able to capture essential semantic dependencies among code elements accurately. In this paper, we propose novel tree-based capsule networks (TreeCaps) and relevant techniques for processing program code in an automated way that encodes code syntactical structures and captures code dependencies more accurately. Based on evaluation on programs written in different programming languages, we show that our TreeCaps-based approach can outperform other approaches in classifying the functionalities of many programs.

\n", "tags": ["representation"], "tsne_embedding": [-5.802390098571777, 13.461023330688477]}, {"key": "jesse2021learning", "year": "2021", "title": "Learning Type Annotation: Is Big Data Enough?", "abstract": "

TypeScript is a widely used optionally-typed language where developers can adopt \u201cpay as you go\u201d typing: they can add types as\ndesired, and benefit from static typing. The \u201ctype annotation tax\u201d\nor manual effort required to annotate new or existing TypeScript\ncan be reduced by a variety of automatic methods. Probabilistic\nmachine-learning (ML) approaches work quite well. ML approaches\nuse different inductive biases, ranging from simple token sequences\nto complex graphical neural network (GNN) models capturing syntax and semantic relations. More sophisticated inductive biases are\nhand-engineered to exploit the formal nature of software. Rather\nthan deploying fancy inductive biases for code, can we just use \u201cbig\ndata\u201d to learn natural patterns relevant to typing? We find evidence\nsuggesting that this is the case. We present TypeBert, demonstrating that even with simple token-sequence inductive bias used in\nBERT-style models and enough data, type-annotation performance\nof the most sophisticated models can be surpassed.

\n", "tags": ["Transformer", "types"], "tsne_embedding": [-3.9114649295806885, 27.69058609008789]}, {"key": "jesse2022learning", "year": "2022", "title": "Learning To Predict User-Defined Types", "abstract": "

TypeScript is a widely adopted gradual typed language where developers can optionally type variables, functions, parameters and more. Probabilistic type inference approaches with ML (machine learning) work well especially for commonly occurring types such as boolean, number, and string. TypeScript permits a wide range of types including developer defined class names and type interfaces. These developer defined types, termed user-defined types, can be written within the realm of language naming conventions. The set of user-defined types is boundless and existing bounded type guessing approaches are an imperfect solution. Existing works either under perform in user-defined types or ignore user-defined types altogether. This work leverages a BERT-style pre-trained model, with multi-task learning objectives, to learn how to type user-defined classes and interfaces. Thus we present DIVERSETYPER, a solution that explores the diverse set of user-defined types by uniquely aligning classes and interfaces declarations to the places in which they are used. DIVERSETYPER surpasses all existing works including those that model user-defined types.

\n", "tags": ["Transformer", "types"], "tsne_embedding": [-4.122315883636475, 27.036470413208008]}, {"key": "jesse2023large", "year": "2023", "title": "Large Language Models and Simple, Stupid Bugs", "abstract": "

With the advent of powerful neural language models, AI-based systems to assist developers in coding tasks are becoming widely available; Copilot is one such system. Copilot uses Codex, a large language model (LLM), to complete code conditioned on a preceding \u201cprompt\u201d. Codex, however, is trained on public GitHub repositories, viz., on code that may include bugs and vulnerabilities. Previous studies [1], [2] show Codex reproduces vulnerabilities seen in training. In this study, we examine how prone Codex is to generate an interesting bug category, single statement bugs, commonly referred to as simple, stupid bugs or SStuBs in the MSR community. We find that Codex and similar LLMs do help avoid some SStuBs, but do produce known, verbatim SStuBs as much as 2x as likely than known, verbatim correct code. We explore the consequences of the Codex generated SStuBs and propose avoidance strategies that suggest the possibility of reducing the production of known, verbatim SStubs, and increase the possibility of producing known, verbatim fixes.

\n", "tags": ["Transformer", "defect"], "tsne_embedding": [12.512516021728516, 5.806422710418701]}, {"key": "jian2021multimodal", "year": "2021", "title": "Multimodal Representation for Neural Code Search", "abstract": "

Semantic code search is about finding semantically relevant code snippets for a given natural language query. In the state-of-the-art approaches, the semantic similarity between code and query is quantified as the distance of their representation in the shared vector space. In this paper, to improve the vector space, we introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. We conduct extensive experiments using a single corpus that is large-scale and multi-language: CodeSearchNet. Our results show that both our tree-serialized representations and multimodal learning model improve the performance of code search. Last, we define intuitive quantification metrics oriented to the completeness of semantic and syntactic information of the code data, to help understand the experimental findings.

\n", "tags": ["search", "representation"], "tsne_embedding": [-2.3492956161499023, -14.955284118652344]}, {"key": "jian2022assemble", "year": "2022", "title": "Assemble Foundation Models for Automatic Code Summarization", "abstract": "

Automatic code summarization is beneficial to software development and maintenance since it reduces the burden of manual tasks. Currently, artificial intelligence is undergoing a paradigm shift. The foundation models pretrained on massive data and finetuned to downstream tasks surpass specially customized models. This trend inspired us to consider reusing foundation models instead of learning from scratch. Based on this, we propose a flexible and robust approach for automatic code summarization based on neural networks. We assemble available foundation models, such as CodeBERT and GPT-2, into a single model named AdaMo. Moreover, we utilize Gaussian noise as the simulation of contextual information to optimize the latent representation. Furthermore, we introduce two adaptive schemes from the perspective of knowledge transfer, namely continuous pretraining and intermediate finetuning, and design intermediate stage tasks for general sequence-to-sequence learning. Finally, we evaluate AdaMo against a benchmark dataset for code summarization, by comparing it with state-of-the-art models.

\n", "tags": ["summarization", "documentation", "language model"], "tsne_embedding": [-16.68464469909668, -5.834439277648926]}, {"key": "jiang2017automatically", "year": "2017", "title": "Automatically Generating Commit Messages from Diffs using Neural Machine Translation", "abstract": "

Commit messages are a valuable resource in comprehension of software evolution, since they provide a record of changes such as feature additions and bug repairs. Unfortunately, programmers often neglect to write good commit messages. Different techniques have been proposed to help programmers by automatically writing these messages. These techniques are effective at describing what changed, but are often verbose and lack context for understanding the rationale behind a change. In contrast, humans write messages that are short and summarize the high level rationale. In this paper, we adapt Neural Machine Translation (NMT) to automatically \u201ctranslate\u201d diffs into commit messages. We trained an NMT algorithm using a corpus of diffs and human-written commit messages from the top 1k Github projects. We designed a filter to help ensure that we only trained the algorithm on higher-quality commit messages. Our evaluation uncovered a pattern in which the messages we generate tend to be either very high or very low quality. Therefore, we created a quality-assurance filter to detect cases in which we are unable to produce good messages, and return a warning instead.

\n", "tags": ["edit", "bimodal"], "tsne_embedding": [-16.974000930786133, 4.264140605926514]}, {"key": "jiang2021treebert", "year": "2021", "title": "TreeBERT: A Tree-Based Pre-Trained Model for Programming Language", "abstract": "

Source code can be parsed into the abstract syntax tree (AST) based on defined syntax rules. However, in pre-training, little work has considered the incorporation of tree structure into the learning process. In this paper, we present TreeBERT, a tree-based pre-trained model for improving programming language-oriented generation tasks. To utilize tree structure, TreeBERT represents the AST corresponding to the code as a set of composition paths and introduces node position embedding. The model is trained by tree masked language modeling (TMLM) and node order prediction (NOP) with a hybrid objective. TMLM uses a novel masking strategy designed according to the tree\u2019s characteristics to help the model understand the AST and infer the missing semantics of the AST. With NOP, TreeBERT extracts the syntactical structure by learning the order constraints of nodes in AST. We pre-trained TreeBERT on datasets covering multiple programming languages. On code summarization and code documentation tasks, TreeBERT outperforms other pre-trained models and state-of-the-art models designed for these tasks. Furthermore, TreeBERT performs well when transferred to the pre-trained unseen programming language.

\n", "tags": ["grammar", "Transformer"], "tsne_embedding": [-12.093164443969727, -5.551645278930664]}, {"key": "johnson2020learning", "year": "2020", "title": "Learning Graph Structure With A Finite-State Automaton Layer", "abstract": "

Graph-based neural network models are producing strong results in a number of domains, in part because graphs provide flexibility to encode domain knowledge in the form of relational structure (edges) between nodes in the graph. In practice, edges are used both to represent intrinsic structure (e.g., abstract syntax trees of programs) and more abstract relations that aid reasoning for a downstream task (e.g., results of relevant program analyses). In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. Motivated by their power in program analyses, we consider relations defined by paths on the base graph accepted by a finite-state automaton. We show how to learn these relations end-to-end by relaxing the problem into learning finite-state automata policies on a graph-based POMDP and then training these policies using implicit differentiation. The result is a differentiable Graph Finite-State Automaton (GFSA) layer that adds a new edge type (expressed as a weighted adjacency matrix) to a base graph. We demonstrate that this layer can find shortcuts in grid-world graphs and reproduce simple static analyses on Python programs. Additionally, we combine the GFSA layer with a larger graph-based model trained end-to-end on the variable misuse program understanding task, and find that using the GFSA layer leads to better performance than using hand-engineered semantic edges or other baseline methods for adding learned edge types.

\n", "tags": ["GNN", "program analysis"], "tsne_embedding": [-1.1242722272872925, 13.185342788696289]}, {"key": "jung2021commitbert", "year": "2021", "title": "CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model", "abstract": "

Commit message is a document that summarizes source code changes in natural language. A good commit message clearly shows the source code changes, so this enhances collaboration between developers. Therefore, our work is to develop a model that automatically writes the commit message. To this end, we release 345K datasets consisting of code modification and commit messages in six programming languages (Python, PHP, Go, Java, JavaScript, and Ruby). Similar to the neural machine translation (NMT) model, using our dataset, we feed the code modification to the encoder input and the commit message to the decoder input and measure the result of the generated commit message with BLEU-4. Also, we propose the following two training methods to improve the result of generating the commit message: (1) A method of preprocessing the input to feed the code modification to the encoder input. (2) A method that uses an initial weight suitable for the code domain to reduce the gap in contextual representation between programming language (PL) and natural language (NL).

\n", "tags": ["dataset", "language model", "Transformer"], "tsne_embedding": [-16.088560104370117, 3.4334723949432373]}, {"key": "kacmajor2019automatic", "year": "2019", "title": "Automatic Acquisition of Annotated Training Corpora for Test-Code Generation", "abstract": "

Open software repositories make large amounts of source code publicly available. Potentially, this source code could be used as training data to develop new, machine learning-based programming tools. For many applications, however, raw code scraped from online repositories does not constitute an adequate training dataset. Building on the recent and rapid improvements in machine translation (MT), one possibly very interesting application is code generation from natural language descriptions. One of the bottlenecks in developing these MT-inspired systems is the acquisition of parallel text-code corpora required for training code-generative models. This paper addresses the problem of automatically synthetizing parallel text-code corpora in the software testing domain. Our approach is based on the observation that self-documentation through descriptive method names is widely adopted in test automation, in particular for unit testing. Therefore, we propose synthesizing parallel corpora comprised of parsed test function names serving as code descriptions, aligned with the corresponding function bodies. We present the results of applying one of the state-of-the-art MT methods on such a generated dataset. Our experiments show that a neural MT model trained on our dataset can generate syntactically correct and semantically relevant short Java functions from quasi-natural language descriptions of functionality.

\n", "tags": [], "tsne_embedding": [-8.124792098999023, -9.005061149597168]}, {"key": "kanade2020pretrained", "year": "2020", "title": "Pre-trained Contextual Embedding of Source Code", "abstract": "

The source code of a program not only serves as a formal description of an executable task, but it also serves to communicate developer intent in a human-readable form. To facilitate this, developers use meaningful identifier names and natural-language documentation. This makes it possible to successfully apply sequence-modeling approaches, shown to be effective in natural-language processing, to source code. A major advancement in natural-language understanding has been the use of pre-trained token embeddings; BERT and other works have further shown that pre-trained contextual embeddings can be extremely powerful and can be fine-tuned effectively for a variety of downstream supervised tasks. Inspired by these developments, we present the first attempt to replicate this success on source code. We curate a massive corpus of Python programs from GitHub to pre-train a BERT model, which we call Code Understanding BERT (CuBERT). We also pre-train Word2Vec embeddings on the same dataset. We create a benchmark of five classification tasks and compare fine-tuned CuBERT against sequence models trained with and without the Word2Vec embeddings. Our results show that CuBERT outperforms the baseline methods by a margin of 2.9-22%. We also show its superiority when fine-tuned with smaller datasets, and over fewer epochs. We further evaluate CuBERT\u2019s effectiveness on a joint classification, localization and repair task involving prediction of two pointers.

\n", "tags": ["pretraining"], "tsne_embedding": [-6.6061692237854, -4.2834930419921875]}, {"key": "karaivanov2014phrase", "year": "2014", "title": "Phrase-Based Statistical Translation of Programming Languages", "abstract": "

Phrase-based statistical machine translation approaches have been\nhighly successful in translating between natural languages and are\nheavily used by commercial systems (e.g. Google Translate).

\n\n

The main objective of this work is to investigate the applicability of\nthese approaches for translating between programming languages.\nTowards that, we investigated several variants of the phrase-based\ntranslation approach: i) a direct application of the approach to\nprogramming languages, ii) a novel modification of the approach\nto incorporate the grammatical structure of the target programming\nlanguage (so to avoid generating target programs which do not\nparse), and iii) a combination of ii) with custom rules added to\nimprove the quality of the translation.

\n\n

To experiment with the above systems, we investigated machine\ntranslation from C# to Java. For the training, which takes about\n60 hours, we used a parallel corpus of 20, 499 C#-to-Java method\ntranslations. We then evaluated each of the three systems above by\ntranslating 1,000 C# methods. Our experimental results indicate\nthat with the most advanced system, about 60% of the translated\nmethods compile (the top ranked) and out of a random sample of 50\ncorrectly compiled methods, 68% (34 methods) were semantically\nequivalent to the reference solution.

\n", "tags": ["migration", "code generation"], "tsne_embedding": [4.153965473175049, -21.851945877075195]}, {"key": "karampatsis2019deep", "year": "2019", "title": "Maybe Deep Neural Networks are the Best Choice for Modeling Source Code", "abstract": "

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. But traditional language models limit the vocabulary to a fixed set of common words. For code, this strong assumption has been shown to have a significant negative effect on predictive performance. But the open vocabulary version of the neural network language models for code have not been introduced in the literature. We present a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names. We employ a segmentation into subword units, subsequences of tokens chosen based on a compression criterion, following previous work in machine translation. Our network achieves best in class performance, outperforming even the state-of-the-art methods of Hellendoorn and Devanbu that are designed specifically to model code. Furthermore, we present a simple method for dynamically adapting the model to a new test project, resulting in increased performance. We showcase our methodology on code corpora in three different languages of over a billion tokens each, hundreds of times larger than in previous work. To our knowledge, this is the largest neural language model for code that has been reported.

\n", "tags": ["language model"], "tsne_embedding": [-3.9433157444000244, 3.6892874240875244]}, {"key": "karampatsis2020big", "year": "2020", "title": "Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code", "abstract": "

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.

\n", "tags": ["language model"], "tsne_embedding": [-3.595381498336792, 3.1005828380584717]}, {"key": "karampatsis2020scelmo", "year": "2020", "title": "SCELMo: Source Code Embeddings from Language Models", "abstract": "

Continuous embeddings of tokens in computer programs have been used to support a variety of software development tools, including readability, code search, and program repair. Contextual embeddings are common in natural language processing but have not been previously applied in software engineering. We introduce a new set of deep contextualized word representations for computer programs based on language models. We train a set of embeddings using the ELMo (embeddings from language models) framework of Peters et al (2018). We investigate whether these embeddings are effective when fine-tuned for the downstream task of bug detection. We show that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection.

\n", "tags": ["pretraining", "defect"], "tsne_embedding": [15.351909637451172, 1.879938006401062]}, {"key": "karmakar2021what", "year": "2021", "title": "What do pre-trained code models know about code?", "abstract": "

Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However, whether the vector representations from these pre-trained models comprehensively encode characteristics of source code well enough to be applicable to a broad spectrum of downstream tasks remains an open question.

\n\n

One way to investigate this is with diagnostic tasks called probes. In this paper, we construct four probing tasks (probing for surface-level, syntactic, structural, and semantic information) for pre-trained code models. We show how probes can be used to identify whether models are deficient in (understanding) certain code properties, characterize different model layers, and get insight into the model sample-efficiency.

\n\n

We probe four models that vary in their expected knowledge of code properties: BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow). While GraphCodeBERT performs more consistently overall, we find that BERT performs surprisingly well on some code tasks, which calls for further investigation.

\n", "tags": ["Transformer"], "tsne_embedding": [-3.6164495944976807, -3.676973581314087]}, {"key": "karmakar2022jemma", "year": "2022", "title": "JEMMA: An Extensible Java Dataset for ML4Code Applications", "abstract": "

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code\u2019s richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

\n", "tags": ["dataset"], "tsne_embedding": [1.4681462049484253, -4.649317741394043]}, {"key": "karpathy2015visualizing", "year": "2015", "title": "Visualizing and Understanding Recurrent Networks", "abstract": "

Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful\napplications in a wide range of machine learning problems that involve sequential\ndata. However, while LSTMs provide exceptional results in practice, the source\nof their performance and their limitations remain rather poorly understood. Using character-level language models as an interpretable testbed, we aim to bridge\nthis gap by providing an analysis of their representations, predictions and error\ntypes. In particular, our experiments reveal the existence of interpretable cells that\nkeep track of long-range dependencies such as line lengths, quotes and brackets.\nMoreover, our comparative analysis with finite horizon n-gram models traces the\nsource of the LSTM improvements to long-range structural dependencies. Finally,\nwe provide analysis of the remaining errors and suggests areas for further study.

\n\n", "tags": ["language model", "code generation"], "tsne_embedding": [-23.728717803955078, 3.430269479751587]}, {"key": "katz2019towards", "year": "2019", "title": "Towards Neural Decompilation", "abstract": "

We address the problem of automatic decompilation, converting a program in low-level representation back to a higher-level human-readable programming language. The problem of decompilation is extremely important for security researchers. Finding vulnerabilities and understanding how malware operates is much easier when done over source code.

\n\n

The importance of decompilation has motivated the construction of hand-crafted rule-based decompilers. Such decompilers have been designed by experts to detect specific control-flow structures and idioms in low-level code and lift them to source level. The cost of supporting additional languages or new language features in these models is very high.

\n\n

We present a novel approach to decompilation based on neural machine translation. The main idea is to automatically learn a decompiler from a given compiler. Given a compiler from a source language S to a target language T , our approach automatically trains a decompiler that can translate (decompile) T back to S . We used our framework to decompile both LLVM IR and x86 assembly to C code with high success rates. Using our LLVM and x86 instantiations, we were able to successfully decompile over 97% and 88% of our benchmarks respectively.

\n", "tags": ["decompilation"], "tsne_embedding": [12.732939720153809, 14.739673614501953]}, {"key": "key2022speak", "year": "2022", "title": "I Speak, You Verify: Toward Trustworthy Neural Program Synthesis", "abstract": "

We develop an approach for improving the trustworthiness and overall accuracy of program synthesizers based on large language models for source code. Given a natural language description of a programming problem, our method samples both candidate programs as well as candidate predicates specifying how the program should behave. We learn to analyze the agreement between programs and predicates to judge both which program is most likely to be correct, and also judge whether the language model is able to solve the programming problem in the first place. This latter capacity allows favoring high precision over broad recall: fostering trust by only proposing a program when the system is certain that it is correct.

\n", "tags": ["synthesis"], "tsne_embedding": [9.666504859924316, 13.692703247070312]}, {"key": "kharkar2022learning", "year": "2022", "title": "Learning to Reduce False Positives in Analytic Bug Detectors", "abstract": "

Due to increasingly complex software design and rapid iterative development, code defects and security vulnerabilities are prevalent in modern software. In response, programmers rely on static analysis tools to regularly scan their codebases and find potential bugs. In order to maximize coverage, however, these tools generally tend to report a significant number of false positives, requiring developers to manually verify each warning. To address this problem, we propose a Transformer-based learning approach to identify false positive bug warnings. We demonstrate that our models can improve the precision of static analysis by 17.5%. In addition, we validated the generalizability of this approach across two major bug types: null dereference and resource leak.

\n", "tags": ["Transformer", "static analysis"], "tsne_embedding": [18.604122161865234, 7.6033244132995605]}, {"key": "kim2020code", "year": "2020", "title": "Code Prediction by Feeding Trees to Transformers", "abstract": "

In this paper, we describe how to leverage Transformer, a recent neural architecture for learning from sequential data (such as text), for code completion. As in the realm of natural language processing, Transformers surpass the prediction accuracy achievable by RNNs; we provide an experimental confirmation of this over a Python dataset.

\n\n

Furthermore, we show that the way to obtain even better accuracy from Transformers is to expose the syntactic structure of code, which is easily recovered by parsing, to the neural network. This works significantly better than presenting the code as a linear token sequence, which is how Transformers were originally intended to be used.

\n\n

To accomplish this, we propose a novel enhancement to the self-attention mechanism of the Transformer. We enable the mechanism to learn weights\u2014that is, how much to focus on each preceding token in the input\u2014not only on the basis of a token\u2019s value, but also on the basis of the spatial relationships, as in their positions in the abstract syntax tree, between each pair of tokens.

\n\n

We provide comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Python corpus internal to Facebook.

\n", "tags": ["autocomplete"], "tsne_embedding": [-10.441862106323242, 3.4941720962524414]}, {"key": "koc2017learning", "year": "2017", "title": "Learning a Classifier for False Positive Error Reports Emitted by Static Code Analysis Tools", "abstract": "

The large scale and high complexity of modern software systems\nmake perfectly precise static code analysis (SCA) infeasible. Therefore SCA tools often over-approximate, so not to miss any real\nproblems. This, however, comes at the expense of raising false\nalarms, which, in practice, reduces the usability of these tools.

\n\n

To partially address this problem, we propose a novel learning\nprocess whose goal is to discover program structures that cause\na given SCA tool to emit false error reports, and then to use this\ninformation to predict whether a new error report is likely to be a\nfalse positive as well. To do this, we first preprocess code to isolate\nthe locations that are related to the error report. Then, we apply\nmachine learning techniques to the preprocessed code to discover\ncorrelations and to learn a classifier.

\n\n

We evaluated this approach in an initial case study of a widely-used SCA tool for Java. Our results showed that for our dataset\nwe could accurately classify a large majority of false positive error\nreports. Moreover, we identified some common coding patterns that\nled to false positive errors. We believe that SCA developers may be\nable to redesign their methods to address these patterns and reduce\nfalse positive error reports.

\n", "tags": ["static analysis"], "tsne_embedding": [18.15314483642578, 7.324861526489258]}, {"key": "kocetkov2022stack", "year": "2022", "title": "The Stack: 3TB of permissively licensed source code", "abstract": "

Large Language Models (LLMs) play an ever-increasing role in the field of\nArtificial Intelligence (AI)\u2013not only for natural language processing but also\nfor code understanding and generation. To stimulate open and responsible\nresearch on LLMs for code, we introduce The Stack, a 3.1 TB dataset\nconsisting of permissively licensed source code in 30 programming languages.\nWe describe how we collect the full dataset, construct a permissively licensed\nsubset, and present promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that\n(1) near-deduplicating the data significantly boosts performance across all\nexperiments, and (2) it is possible to match previously reported HumanEval\nand MBPP performance using only permissively licensed data. We make the\ndataset available at https://hf.co/BigCode and give developers the possi-\nbility to have their code removed from the dataset by following the instruc-\ntions at https://www.bigcode-project.org/docs/about/the-stack/.

\n", "tags": ["dataset"], "tsne_embedding": [-0.10706917941570282, 4.5177388191223145]}, {"key": "korbak2021energy", "year": "2021", "title": "Energy-Based Models for Code Generation under Compilability Constraints", "abstract": "

Neural language models can be successfully trained on source code, leading to applications such as code completion. However, their versatile autoregressive self-supervision objective overlooks important global sequence-level features that are present in the data such as syntactic correctness or compilability. In this work, we pose the problem of learning to generate compilable code as constraint satisfaction. We define an Energy-Based Model (EBM) representing a pre-trained generative model with an imposed constraint of generating only compilable sequences. We then use the KL-Adaptive Distributional Policy Gradient algorithm (Khalifa et al., 2021) to train a generative model approximating the EBM. We conduct experiments showing that our proposed approach is able to improve compilability rates without sacrificing diversity and complexity of the generated samples.

\n", "tags": ["code generation"], "tsne_embedding": [-21.631393432617188, -0.5758680105209351]}, {"key": "kovalchuk2022human", "year": "2022", "title": "Human perceiving behavior modeling in evaluation of code generation models", "abstract": "

Within this study, we evaluated a series of code generation models based on CodeGen and GPTNeo to compare the metric-based performance and human evaluation. For a deeper analysis of human perceiving within the evaluation procedure we\u2019ve implemented a 5-level Likert scale assessment of the model output using a perceiving model based on the Theory of Planned Behavior (TPB). Through such analysis, we showed an extension of model assessment as well as a deeper understanding of the quality and applicability of generated code for practical question answering. The approach was evaluated with several model settings in order to assess diversity in quality and style of answer. With the TPB-based model, we showed a different level of perceiving the model result, namely personal understanding, agreement level, and readiness to use the particular code. With such analysis, we investigate a series of issues in code generation as natural language generation (NLG) problems observed in a practical context of programming question-answering with code.

\n", "tags": ["code generation", "evaluation", "human evaluation"], "tsne_embedding": [5.77072811126709, -2.769697904586792]}, {"key": "kovalchuk2023test", "year": "2023", "title": "Test-based and metric-based evaluation of code generation models for practical question answering", "abstract": "

We performed a comparative analysis of code generation model performance with evaluation using common NLP metrics in comparison to a test-based evaluation. The investigation was performed in the context of question answering with code (test-to-code problem) and was aimed at applicability checking both ways for generated code evaluation in a fully automatic manner. We used CodeGen and GPTNeo pretrained models applied to a problem of question answering using Stack Overflow-based corpus (APIzation). For test-based evaluation, industrial test-generation solutions (Machinet, UTBot) were used for providing automatically generated tests. The analysis showed that the performance evaluation based solely on NLP metrics or on tests provides a rather limited assessment of generated code quality. We see the evidence that predictions with both high and low NLP metrics exist that pass and don\u2019t pass tests. With the early results of our empirical study being discussed in this paper, we believe that the combination of both approaches may increase possible ways for building, evaluating, and training code generation models.

\n", "tags": ["code generation", "test generation", "natural language generation", "evaluation", "metrics", "natural language processing"], "tsne_embedding": [5.394495487213135, -2.669963836669922]}, {"key": "kovalenko2019pathminer", "year": "2019", "title": "PathMiner : A Library for Mining of Path-Based Representations of Code", "abstract": "

One recent, significant advance in modeling source code for machine learning algorithms has been the introduction of path-based representation \u2013 an approach consisting in representing a snippet of code as a collection of paths from its syntax tree. Such representation efficiently captures the structure of code, which, in turn, carries its semantics and other information.\nBuilding the path-based representation involves parsing the code and extracting the paths from its syntax tree; these steps build up to a substantial technical job. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from the essential work and hinders newcomers in the field of machine learning on code.

\n\n

In this paper, we present PathMiner \u2013 an open-source library for mining path-based representations of code. PathMiner is fast, flexible, well-tested, and easily extensible to support input code in any common programming language. Preprint [https://doi.org/10.5281/zenodo.2595271]; released tool [https://doi.org/10.5281/zenodo.2595257].

\n", "tags": ["representation", "grammar"], "tsne_embedding": [8.773146629333496, -14.394119262695312]}, {"key": "kremenek2007factor", "year": "2007", "title": "A Factor Graph Model for Software Bug Finding", "abstract": "

Automatic tools for finding software errors require\nknowledge of the rules a program must obey, or\n\u201cspecifications,\u201d before they can identify bugs. We\npresent a method that combines factor graphs and\nstatic program analysis to automatically infer specifications directly from programs. We illustrate the\napproach on inferring functions in C programs that\nallocate and release resources, and evaluate the approach on three codebases: SDL, OpenSSH, and\nthe OS kernel for Mac OS X (XNU). The inferred\nspecifications are highly accurate and with them we\nhave discovered numerous bugs.

\n\n", "tags": ["program analysis"], "tsne_embedding": [20.24456024169922, 9.622844696044922]}, {"key": "kulal2019spoc", "year": "2019", "title": "SPoC: Search-based Pseudocode to Code", "abstract": "

We consider the task of mapping pseudocode to long programs that are functionally correct. Given test cases as a mechanism to validate programs, we search over the space of possible translations of the pseudocode to find a program that passes the validation. However, without proper credit assignment to localize the sources of program failures, it is difficult to guide search toward more promising programs. We propose to perform credit assignment based on signals from compilation errors, which constitute 88.7% of program failures. Concretely, we treat the translation of each pseudocode line as a discrete portion of the program, and whenever a synthesized program fails to compile, an error localization method tries to identify the portion of the program responsible for the failure. We then focus search over alternative translations of the pseudocode for those portions. For evaluation, we collected the SPoC dataset (Search-based Pseudocode to Code) containing 18,356 programs with human-authored pseudocode and test cases. Under a budget of 100 program compilations, performing search improves the synthesis success rate over using the top-one translation of the pseudocode from 25.6% to 44.7%.

\n", "tags": ["bimodal", "synthesis"], "tsne_embedding": [12.680595397949219, -3.3031506538391113]}, {"key": "kurbatova2020recommendation", "year": "2020", "title": "Recommendation of Move Method Refactoring Using Path-Based Representation of Code", "abstract": "

Software refactoring plays an important role in increasing code quality. One of the most popular refactoring types is the Move Method refactoring. It is usually applied when a method depends more on members of other classes than on its own original class. Several approaches have been proposed to recommend Move Method refactoring automatically. Most of them are based on heuristics and have certain limitations (e.g., they depend on the selection of metrics and manually-defined thresholds). In this paper, we propose an approach to recommend Move Method refactoring based on a path-based representation of code called code2vec that is able to capture the syntactic structure and semantic information of a code fragment. We use this code representation to train a machine learning classifier suggesting to move methods to more appropriate classes. We evaluate the approach on two publicly available datasets: a manually compiled dataset of well-known open-source projects and a synthetic dataset with automatically injected code smell instances. The results show that our approach is capable of recommending accurate refactoring opportunities and outperforms JDeodorant and JMove, which are state of the art tools in this field.

\n", "tags": ["refactoring"], "tsne_embedding": [14.888970375061035, -11.777994155883789]}, {"key": "kushman2013using", "year": "2013", "title": "Using Semantic Unification to Generate Regular Expressions from Natural Language", "abstract": "

We consider the problem of translating natural language text queries into regular expressions which represent their meaning. The mismatch in the level of abstraction between the natural language representation and the regular expression representation make this a novel and challenging problem. However, a given regular expression can be written in many semantically equivalent forms, and we exploit this flexibility to facilitate translation by finding a form which more directly corresponds to the natural language. We evaluate our technique on a set of natural language queries and their associated regular expressions which we gathered from Amazon Mechanical Turk. Our model substantially outperforms a state-of-the-art semantic parsing baseline, yielding a 29% absolute improvement in accuracy.

\n", "tags": ["bimodal", "code generation"], "tsne_embedding": [-18.98016357421875, -19.69573211669922]}, {"key": "lachaux2020unsupervised", "year": "2020", "title": "Unsupervised Translation of Programming Languages", "abstract": "

A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

\n", "tags": ["migration"], "tsne_embedding": [3.632676839828491, 5.555427551269531]}, {"key": "lacomis2019neural", "year": "2019", "title": "A Neural Approach to Decompiled Identifier Renaming", "abstract": "

The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. However, compilation loses information contained within the original source code (e.g. structure, type information, and variable names). Semantically meaningful variable names are known to increase code understandability, but they generally cannot be recovered by decompilers. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GitHub. Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3% of the time.

\n", "tags": ["deobfuscation", "naming", "compilation"], "tsne_embedding": [14.803936004638672, 17.555686950683594]}, {"key": "lanchantin2018exploring", "year": "2018", "title": "Exploring the Naturalness of Buggy Code with Recurrent Neural Network", "abstract": "

Statistical language models are powerful tools\nwhich have been used for many tasks within natural language processing. Recently, they have been\nused for other sequential data such as source code.\n(Ray et al., 2015) showed that it is possible train an\nn-gram\nsource code language mode, and use it to\npredict buggy lines in code by determining \u201cunnatural\u201d lines via entropy with respect to the language\nmodel. In this work, we propose using a more advanced language modeling technique, Long Short-term Memory recurrent neural networks, to model\nsource code and classify buggy lines based on entropy. We show that our method slightly outperforms an\nn-gram model in the buggy line classification task using AUC

\n", "tags": ["language model", "defect"], "tsne_embedding": [-12.701126098632812, 7.303003787994385]}, {"key": "leclair2019neural", "year": "2019", "title": "A Neural Model for Generating Natural Language Summaries of Program Subroutines", "abstract": "

Source code summarization \u2013 creating natural language descriptions of source code behavior \u2013 is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven approaches based on neural machine translation have largely overtaken template-based systems. But nearly all of these techniques rely almost entirely on programs having good internal documentation; without clear identifier names, the models fail to create good summaries. In this paper, we present a neural model that combines words from code with code structure from an AST. Unlike previous approaches, our model processes each data source as a separate input, which allows the model to learn code structure independent of the text in code. This process helps our approach provide coherent summaries in many cases even when zero internal documentation is provided. We evaluate our technique with a dataset we created from 2.1m Java methods. We find improvement over two baseline techniques from SE literature and one from NLP literature.

\n", "tags": ["summarization", "documentation"], "tsne_embedding": [-15.732378005981445, -7.262314319610596]}, {"key": "leclair2019recommendations", "year": "2019", "title": "Recommendations for Datasets for Source Code Summarization", "abstract": "

Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable datasets. In addition, a lack of community standards for creating datasets leads to confusing and unreproducible research results \u2013 we observe swings in performance of more than 33% due only to changes in dataset design. In this paper, we make recommendations for these standards from experimental results. We release a dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects. We describe the dataset and point out key differences from natural language data, to guide and support future researchers.

\n", "tags": ["summarization", "dataset"], "tsne_embedding": [-16.4310302734375, -11.204122543334961]}, {"key": "leclair2020improved", "year": "2020", "title": "Improved Code Summarization via a Graph Neural Network", "abstract": "

Automatic source code summarization is the task of generating natural language descriptions for source code. Automatic code summarization is a rapidly expanding research area, especially as the community has taken greater advantage of advances in neural network and AI technologies. In general, source code summarization techniques use the source code as input and outputs a natural language description. Yet a strong consensus is developing that using structural information as input leads to improved performance. The first approaches to use structural information flattened the AST into a sequence. Recently, more complex approaches based on random AST paths or graph neural networks have improved on the models using flattened ASTs. However, the literature still does not describe the using a graph neural network together with source code sequence as separate inputs to a model. Therefore, in this paper, we present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries. We evaluate our technique using a data set of 2.1 million Java method-comment pairs and show improvement over four baseline techniques, two from the software engineering literature, and two from machine learning literature.

\n", "tags": ["summarization"], "tsne_embedding": [-16.44574737548828, -7.117654800415039]}, {"key": "lee2020montage", "year": "2020", "title": "Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer", "abstract": "

JavaScript (JS) engine vulnerabilities pose significant security threats affecting billions of web browsers. While fuzzing is a prevalent technique for finding such vulnerabilities, there have been few studies that leverage the recent advances in neural network language models (NNLMs). In this paper, we present Montage, the first NNLM-guided fuzzer for finding JS engine vulnerabilities. The key aspect of our technique is to transform a JS abstract syntax tree (AST) into a sequence of AST subtrees that can directly train prevailing NNLMs. We demonstrate that Montage is capable of generating valid JS tests, and show that it outperforms previous studies in terms of finding vulnerabilities. Montage found 37 real-world bugs, including three CVEs, in the latest JS engines, demonstrating its efficacy in finding JS engine bugs.

\n", "tags": ["fuzzing", "language model"], "tsne_embedding": [18.844755172729492, 15.456668853759766]}, {"key": "lee2021cotraining", "year": "2021", "title": "Co-Training for Commit Classification", "abstract": "

Commits in version control systems (e.g. Git) track changes in a software project. Commits comprise noisy user-generated natural language and code patches. Automatic commit classification (CC) has been used to determine the type of code maintenance activities performed, as well as to detect bug fixes in code repositories. Much prior work occurs in the fully-supervised setting \u2013 a setting that can be a stretch in resource-scarce situations presenting difficulties in labeling commits. In this paper, we apply co-training, a semi-supervised learning method, to take advantage of the two views available \u2013 the commit message (natural language) and the code changes (programming language) \u2013 to improve commit classification.

\n", "tags": ["Transformer", "bimodal", "defect"], "tsne_embedding": [-15.535037994384766, 4.986502170562744]}, {"key": "levy2017learning", "year": "2017", "title": "Learning to Align the Source Code to the Compiled Object Code", "abstract": "

We propose a new neural network architecture\nand use it for the task of statement-by-statement\nalignment of source code and its compiled object code. Our architecture learns the alignment\nbetween the two sequences \u2013 one being the translation of the other \u2013 by mapping each statement\nto a context-dependent representation vector and\naligning such vectors using a grid of the two sequence domains. Our experiments include short\nC functions, both artificial and human-written,\nand show that our neural network architecture\nis able to predict the alignment with high accuracy, outperforming known baselines. We also\ndemonstrate that our model is general and can\nlearn to solve graph problems such as the Traveling Salesman Problem.

\n", "tags": ["decompilation"], "tsne_embedding": [-9.241816520690918, 12.116851806640625]}, {"key": "lherondelle2022topical", "year": "2022", "title": "Topical: Learning Repository Embeddings from Source Code using Attention", "abstract": "

Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode\naugments the software developer\u2019s capabilities with code autogeneration, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script level\nrepresentation of code is sufficient, however, in many cases a repository level representation that takes into account various dependencies and repository structure is imperative, for example,\nauto-tagging repositories with topics or auto-documentation of repository code etc. Existing methods for computing repository level representations suffer from (a) reliance on natural language\ndocumentation of code (for example, README files) (b) naive aggregation of method/script-level representation, for example, by concatenation or averaging. This paper introduces Topical a\ndeep neural network to generate repository level embeddings of publicly available GitHub code repositories directly from source code. Topical incorporates an attention mechanism that projects the source code, the full dependency graph and the\nscript level textual information into a dense repository-level representation. To compute the repository-level representations, Topical is trained to predict the topics associated with a repository, on a dataset of publicly available GitHub repositories that\nwere crawled along with their ground truth topic tags. Our experiments show that the embeddings computed by Topical are able to outperform multiple baselines, including baselines\nthat naively combine the method-level representations through averaging or concatenation at the task of repository auto-tagging. Furthermore, we show that Topical\u2019s attention mechanism outperforms naive aggregation methods when computing repositorylevel representations from script-level representation generated\nby existing methods. Topical is a lightweight framework for computing repository-level representation of code repositories that scales efficiently with the number of topics and dataset size.

\n", "tags": ["representation", "topic modelling"], "tsne_embedding": [-1.6275664567947388, -10.85648250579834]}, {"key": "li2016gated", "year": "2016", "title": "Gated Graph Sequence Neural Networks", "abstract": "

Graph-structured data appears frequently in domains including chemistry, natural\nlanguage semantics, social networks, and knowledge bases. In this work, we study\nfeature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify\nto use gated recurrent units and modern optimization techniques and then extend\nto output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based\nmodels (e.g., LSTMs) when the problem is graph-structured. We demonstrate the\ncapabilities on some simple AI (bAbI) and graph algorithm learning tasks. We\nthen show it achieves state-of-the-art performance on a problem from program\nverification, in which subgraphs need to be described as abstract data structures.

\n\n", "tags": ["GNN", "program analysis"], "tsne_embedding": [-1.6541056632995605, 15.015466690063477]}, {"key": "li2017code", "year": "2017", "title": "Code Completion with Neural Attention and Pointer Networks", "abstract": "

Intelligent code completion has become an essential tool to accelerate modern software development. To facilitate effective code completion for dynamically-typed programming languages, we apply neural language models by learning from large codebases, and investigate the effectiveness of attention mechanism on the code completion task. However, standard neural language models even with attention mechanism cannot correctly predict out-of-vocabulary (OoV) words thus restrict the code completion performance. In this paper, inspired by the prevalence of locally repeated terms in program source code, and the recently proposed pointer networks which can reproduce words from local context, we propose a pointer mixture network for better predicting OoV words in code completion. Based on the context, the pointer mixture network learns to either generate a within-vocabulary word through an RNN component, or copy an OoV word from local context through a pointer component. Experiments on two benchmarked datasets demonstrate the effectiveness of our attention mechanism and pointer mixture network on the code completion task.

\n\n", "tags": ["language model", "autocomplete"], "tsne_embedding": [-7.771483898162842, 5.322144031524658]}, {"key": "li2017software", "year": "2017", "title": "Software Defect Prediction via Convolutional Neural Network", "abstract": "

To improve software reliability, software defect prediction is utilized to assist developers in finding potential bugs\nand allocating their testing efforts. Traditional defect prediction\nstudies mainly focus on designing hand-crafted features, which\nare input into machine learning classifiers to identify defective\ncode. However, these hand-crafted features often fail to capture\nthe semantic and structural information of programs. Such\ninformation is important in modeling program functionality and\ncan lead to more accurate defect prediction.\nIn this paper, we propose a framework called Defect Prediction\nvia Convolutional Neural Network (DP-CNN), which leverages\ndeep learning for effective feature generation. Specifically, based\non the programs\u2019 Abstract Syntax Trees (ASTs), we first extract\ntoken vectors, which are then encoded as numerical vectors\nvia mapping and word embedding. We feed the numerical\nvectors into Convolutional Neural Network to automatically\nlearn semantic and structural features of programs. After that,\nwe combine the learned features with traditional hand-crafted\nfeatures, for accurate software defect prediction. We evaluate our\nmethod on seven open source projects in terms of F-measure in\ndefect prediction. The experimental results show that in average,\nDP-CNN improves the state-of-the-art method by 12%.

\n\n", "tags": ["defect"], "tsne_embedding": [13.863176345825195, 3.081108331680298]}, {"key": "li2019improving", "year": "2019", "title": "Improving Bug Detection via Context-Based Code Representation Learning and Attention-Based Neural Networks", "abstract": "

Bug detection has been shown to be an effective way to help developers in detecting bugs early, thus, saving much effort and time in software development process. Recently, deep learning-based bug detection approaches have gained successes over the traditional machine learning-based approaches, the rule-based program analysis approaches, and mining-based approaches. However, they are still limited in detecting bugs that involve multiple methods and suffer high rate of false positives. In this paper, we propose a combination approach with the use of contexts and attention neural network to overcome those limitations. We propose to use as the global context the Program Dependence Graph (PDG) and Data Flow Graph (DFG) to connect the method under investigation with the other relevant methods that might contribute to the buggy code. The global context is complemented by the local context extracted from the path on the AST built from the method\u2019s body. The use of PDG and DFG enables our model to reduce the false positive rate, while to complement for the potential reduction in recall, we make use of the attention neural network mechanism to put more weights on the buggy paths in the source code. That is, the paths that are similar to the buggy paths will be ranked higher, thus, improving the recall of our model. We have conducted several experiments to evaluate our approach on a very large dataset with +4.973M methods in 92 different project versions. The results show that our tool can have a relative improvement up to 160% on F-score when comparing with the state-of-the-art bug detection approaches. Our tool can detect 48 true bugs in the list of top 100 reported bugs, which is 24 more true bugs when comparing with the baseline approaches. We also reported that our representation is better suitable for bug detection and relatively improves over the other representations up to 206% in accuracy.

\n", "tags": ["representation", "defect"], "tsne_embedding": [16.78055191040039, 4.197618007659912]}, {"key": "li2019neural", "year": "2019", "title": "Neural Code Search Evaluation Dataset", "abstract": "

There has been an increase of interest in code search using natural language. Assessing the performance of such code search models can be difficult without a readily available evaluation suite. In this paper, we present an evaluation dataset consisting of natural language query and code snippet pairs, with the hope that future work in this area can use this dataset as a common benchmark. We also provide the results of two code search models ([1] and [6]) from recent work.

\n", "tags": ["dataset", "search"], "tsne_embedding": [-3.528033494949341, -14.519452095031738]}, {"key": "li2019using", "year": "2019", "title": "Using GGNN to recommend log statement level", "abstract": "

In software engineering, log statement is an important part because programmers can\u2019t access to users\u2019 program and they can only rely on log message to find the root of bugs. The mechanism of \u201clog level\u201d allows developers and users to specify the appropriate amount of logs to print during the execution of the software. And 26\\% of the log statement modification is to modify the level. We tried to use ML method to predict the suitable level of log statement. The specific model is GGNN(gated graph neural network) and we have drawn lessons from Microsoft\u2019s research. In this work, we apply Graph Neural Networks to predict the usage of log statement level of some open source java projects from github. Given the good performance of GGNN in this task, we are confident that GGNN is an excellent choice for processing source code. We envision this model can play an important role in applying AI/ML technique for Software Development Life Cycle more broadly.

\n", "tags": ["GNN", "logging"], "tsne_embedding": [-4.730812072753906, 11.533068656921387]}, {"key": "li2020dlfix", "year": "2020", "title": "DLFix: Context-based Code Transformation Learning for Automated Program Repair", "abstract": "

Automated Program Repair (APR) is very useful in helping developers in the process of software development and maintenance. Despite recent advances in deep learning (DL), the DL-based APR approaches still have limitations in learning bug-fixing code changes and the context of the surrounding source code of the bug-fixing code changes. These limitations lead to incorrect fixing locations or fixes. In this paper, we introduce DLFix, a two-tier DL model that treats APR as code transformation learning from the prior bug fixes and the surrounding code contexts of the fixes. The first layer is a tree-based RNN model that learns the contexts of bug fixes and its result is used as an additional weighting input for the second layer designed to learn the bug-fixing code transformations.

\n\n

We conducted several experiments to evaluate DLFix in two benchmarks: Defect4J and Bugs.jar, and a newly built bug datasets with a total of +20K real-world bugs in eight projects. We compared DLFix against a total of 13 state-of-the-art pattern-based APR tools. Our results show that DLFix can auto-fix more bugs than 11 of them, and is comparable and complementary to the top two pattern-based APR tools in which there are 7 and 11 unique bugs that they cannot detect, respectively, but we can. Importantly, DLFix is fully automated and data-driven, and does not require hard-coding of bug-fixing patterns as in those tools. We compared DLFix against 4 state-of-the-art deep learning based APR models. DLFix is able to fix 2.5 times more bugs than the best performing~baseline.

\n", "tags": ["edit", "repair", "grammar"], "tsne_embedding": [19.193174362182617, 2.095975637435913]}, {"key": "li2020learning", "year": "2020", "title": "Learning Code-Query Interaction for Enhancing Code Searches", "abstract": "

Code search plays an important role in software development and maintenance. In recent years, deep learning (DL) has achieved a great success in this domain-several DL-based code search methods, such as DeepCS and UNIF, have been proposed for exploring deep, semantic correlations between code and queries; each method usually embeds source code and natural language queries into real vectors followed by computing their vector distances representing their semantic correlations. Meanwhile, deep learning-based code search still suffers from three main problems, i.e., the OOV (Out of Vocabulary) problem, the independent similarity matching problem, and the small training dataset problem. To tackle the above problems, we propose CQIL, a novel, deep learning-based code search method. CQIL learns code-query interactions and uses a CNN (Convolutional Neural Network) to compute semantic correlations between queries and code snippets. In particular, CQIL employs a hybrid representation to model code-query correlations, which solves the OOV problem. CQIL also deeply learns the code-query interaction for enhancing code searches, which solves the independent similarity matching and the small training dataset problems. We evaluate CQIL on two datasets (CODEnn and CosBench). The evaluation results show the strengths of CQIL-it achieves the MAP@1 values, 0.694 and 0.574, on CODEnn and CosBench, respectively. In particular, it outperforms DeepCS and UNIF, two state-of-the-art code search methods, by 13.6% and 18.1% in MRR, respectively, when the training dataset is insufficient.

\n", "tags": ["search"], "tsne_embedding": [-1.0473586320877075, -15.357560157775879]}, {"key": "li2021learning", "year": "2021", "title": "Learning to Extend Program Graphs to Work-in-Progress Code", "abstract": "

Source code spends most of its time in a broken or incomplete state during software development. This presents a challenge to machine learning for code, since high-performing models typically rely on graph structured representations of programs derived from traditional program analyses. Such analyses may be undefined for broken or incomplete code. We extend the notion of program graphs to work-in-progress code by learning to predict edge relations between tokens, training on well-formed code before transferring to work-in-progress code. We consider the tasks of code completion and localizing and repairing variable misuse in a work-in-process scenario. We demonstrate that training relation-aware models with fine-tuned edges consistently leads to improved performance on both tasks.

\n", "tags": ["Transformer", "autocomplete", "repair"], "tsne_embedding": [-1.3375908136367798, 10.895987510681152]}, {"key": "li2021toward", "year": "2021", "title": "Toward Less Hidden Cost of Code Completion with Acceptance and Ranking Models", "abstract": "

Code completion is widely used by software developers to provide coding suggestions given a partially written code snippet. Apart from the traditional code completion methods, which only support single token completion at minimal positions, recent studies show the ability to provide longer code completion at more flexible positions. However, such frequently triggered and longer completion results reduce the overall precision as they generate more invalid results. Moreover, different studies are mostly incompatible with each other. Thus, it is vital to develop an ensemble framework that can combine results from multiple models to draw merits and offset defects of each model.\nThis paper conducts a coding simulation to collect data from code context and different code completion models and then apply the data in two tasks. First, we introduce an acceptance model which can dynamically control whether to display completion results to the developer. It uses simulation features to predict whether correct results exist in the output of these models. Our best model reduces the percentage of false-positive completion from 55.09% to 17.44%. Second, we design a fusion ranking scheme that can automatically identify the priority of the completion results and reorder the candidates from multiple code completion models. This scheme is flexible in dealing with various models, regardless of the type or the length of their completion results. We integrate this ranking scheme with two frequency models and a GPT-2 styled language model, along with the acceptance model to yield 27.80% and 37.64% increase in TOP1 and TOP5 accuracy, respectively. In addition, we propose a new code completion evaluation metric, Benefit-Cost Ratio(BCR), taking into account the benefit of keystrokes saving and hidden cost of completion list browsing, which is closer to real coder experience scenario.

\n", "tags": ["autocomplete", "language model", "optimization", "Transformer"], "tsne_embedding": [-9.692549705505371, -15.118404388427734]}, {"key": "li2022codereviewer", "year": "2022", "title": "CodeReviewer: Pre-Training for Automating Code Review Activities", "abstract": "

Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review senario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews.

\n", "tags": ["review"], "tsne_embedding": [-8.166886329650879, 1.6699076890945435]}, {"key": "li2022exploring", "year": "2022", "title": "Exploring Representation-Level Augmentation for Code Search", "abstract": "

Code search, which aims at retrieving the most relevant code fragment for a given natural language query, is a common activity in software development practice. Recently, contrastive learning is widely used in code search research, where many data augmentation approaches for source code (e.g., semantic-preserving program transformation) are proposed to learn better representations. However, these augmentations are at the raw-data level, which requires additional code analysis in the preprocessing stage and additional training costs in the training stage. In this paper, we explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training, and based on this we propose a general format of representation-level augmentation that unifies existing methods. Then, we propose three new augmentation methods (linear extrapolation, binary interpolation, and Gaussian scaling) based on the general format. Furthermore, we theoretically analyze the advantages of the proposed augmentation methods over traditional contrastive learning methods on code search. We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. The experimental results show that our approach can consistently boost the performance of the studied code search models.

\n", "tags": ["search", "Transformer"], "tsne_embedding": [-4.31868314743042, -16.66735076904297]}, {"key": "li2023hitchhiker", "year": "2023", "title": "The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models", "abstract": "

Static analysis is a widely used technique in software engineering for identifying and mitigating bugs. However, a significant hurdle lies in achieving a delicate balance between precision and scalability. Large Language Models (LLMs) offer a promising alternative, as recent advances demonstrate remarkable capabilities in comprehending, generating, and even debugging code. Yet, the logic of bugs can be complex and require sophisticated reasoning and a large analysis scope spanning multiple functions. Therefore, at this point, LLMs are better used in an assistive role to complement static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis, using use-before-initialization (UBI) bugs as a case study. To this end, we develop LLift, a fully automated agent that interfaces with both a static analysis tool and an LLM. By carefully designing the agent and the prompts, we are able to overcome a number of challenges, including bug-specific modeling, the large problem scope, the non-deterministic nature of LLMs, etc. Tested in a real-world scenario analyzing nearly a thousand potential UBI bugs produced by static analysis, LLift demonstrates an extremely potent capability, showcasing a high precision (50%) and recall rate (100%). It even identified 13 previously unknown UBI bugs in the Linux kernel. This research paves the way for new opportunities and methodologies in the use of LLMs for bug discovery in extensive, real-world datasets.

\n", "tags": ["static analysis"], "tsne_embedding": [18.964157104492188, 9.538141250610352]}, {"key": "li2023rethinking", "year": "2023", "title": "Rethinking Negative Pairs in Code Search", "abstract": "

Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less ``negative\u2019\u2019 than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages.

\n", "tags": ["search", "Transformer", "retrieval", "optimization", "representation"], "tsne_embedding": [-4.203897953033447, -18.81262969970703]}, {"key": "li2023starcoder", "year": "2023", "title": "StarCoder: may the source be with you!", "abstract": "

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

\n", "tags": ["Transformer"], "tsne_embedding": [1.2185238599777222, 3.8435590267181396]}, {"key": "li2023think", "year": "2023", "title": "Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation", "abstract": "

Code generation aims to automatically generate source code from high-level task specifications, which can significantly increase productivity of software engineering. Recently, approaches based on large language models (LLMs) have shown remarkable code generation abilities on simple tasks. However, generate code for more complex tasks, such as competition-level problems, remains challenging. In this paper, we introduce Brainstorm framework for code generation. It leverages a brainstorming step that generates and selects diverse thoughts on the problem to facilitate algorithmic reasoning, where the thoughts are possible blueprint of solving the problem. We demonstrate that Brainstorm significantly enhances the ability of LLMs to solve competition-level programming problems, resulting in a more than 50% increase in the pass@$k$ metrics for ChatGPT on the CodeContests benchmark, achieving state-of-the-art performance. Furthermore, our experiments conducted on LeetCode contests show that our framework boosts the ability of ChatGPT to a level comparable to that of human programmers.

\n", "tags": ["generation", "Transformer"], "tsne_embedding": [5.53621244430542, -0.9227759838104248]}, {"key": "li2024rewriting", "year": "2024", "title": "Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search", "abstract": "

In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of Large Language Models (LLMs). Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances.

\n", "tags": ["search", "large language models", "metrics"], "tsne_embedding": [-6.422396659851074, -16.45843505859375]}, {"key": "liguori2021shellcode_ia32", "year": "2021", "title": "Shellcode_IA32: A Dataset for Automatic Shellcode Generation", "abstract": "

We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability, starting from natural language comments. We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions. We experiment with standard methods in neural machine translation (NMT) to establish baseline performance levels on this task.

\n", "tags": ["code generation", "dataset"], "tsne_embedding": [10.37258529663086, 3.098116159439087]}, {"key": "lin2017program", "year": "2017", "title": "Program Synthesis from Natural Language Using Recurrent Neural Networks", "abstract": "

Oftentimes, a programmer may have difficulty implementing a\ndesired operation. Even when the programmer can describe her\ngoal in English, it can be difficult to translate into code. Existing\nresources, such as question-and-answer websites, tabulate specific\noperations that someone has wanted to perform in the past, but\nthey are not effective in generalizing to new tasks, to compound\ntasks that require combining previous questions, or sometimes even\nto variations of listed tasks.

\n\n

Our goal is to make programming easier and more productive by\nletting programmers use their own words and concepts to express\nthe intended operation, rather than forcing them to accommodate\nthe machine by memorizing its grammar. We have built a system\nthat lets a programmer describe a desired operation in natural language, then automatically translates it to a programming language\nfor review and approval by the programmer. Our system, Tellina,\ndoes the translation using recurrent neural networks (RNNs), a\nstate-of-the-art natural language processing technique that we augmented with slot (argument) filling and other enhancements.

\n\n

We evaluated Tellina in the context of shell scripting. We trained\nTellina\u2019s RNNs on textual descriptions of file system operations\nand bash one-liners, scraped from the web. Although recovering\ncompletely correct commands is challenging, Tellina achieves top-3\naccuracy of 80% for producing the correct command structure. In a\ncontrolled study, programmers who had access to Tellina outperformed those who did not, even when Tellina\u2019s predictions were\nnot completely correct, to a statistically significant degree.

\n", "tags": ["bimodal", "code generation"], "tsne_embedding": [8.345367431640625, 3.1196534633636475]}, {"key": "lin2018nl2bash", "year": "2018", "title": "NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System", "abstract": "

We present new data and semantic parsing methods for the problem of mapping english sentences to Bash commands (NL2Bash). Our long-term goal is to enable any user to easily solve otherwise repetitive tasks (such as file manipulation, search, and application-specific scripting) by simply stating their intents in English. We take a first step in this domain, by providing a large new dataset of challenging but commonly used commands paired with their English descriptions, along with the baseline methods to establish performance levels on this task.

\n", "tags": ["bimodal", "code generation"], "tsne_embedding": [-9.047262191772461, -11.888605117797852]}, {"key": "lin2019impact", "year": "2019", "title": "On the Impact of Refactoring Operations on Code Naturalness", "abstract": "

Recent studies have demonstrated that software is natural, that is, its source code is highly repetitive and predictable like human languages. Also, previous studies suggested the existence of a relationship between code quality and its naturalness, presenting empirical evidence showing that buggy code is \u201cless natural\u201d than non-buggy code. We conjecture that this qualitynaturalness relationship could be exploited to support refactoring activities (e.g., to locate source code areas in need of refactoring). We perform a first step in this direction by analyzing whether refactoring can improve the naturalness of code. We use state-of-the-art tools to mine a large dataset of refactoring operations performed in open source systems. Then, we investigate the impact of different types of refactoring operations on the naturalness of the impacted code. We found that (i) code refactoring does not necessarily increase the naturalness of the refactored code; and (ii) the impact on the code naturalness strongly depends on the type of refactoring operations.

\n", "tags": ["language model", "refactoring"], "tsne_embedding": [15.612140655517578, -11.490396499633789]}, {"key": "ling2016latent", "year": "2016", "title": "Latent Predictor Networks for Code Generation", "abstract": "

Many language generation tasks require\nthe production of text conditioned on both\nstructured and unstructured inputs.\nWe present a novel neural network architecture which generates an output sequence\nconditioned on an arbitrary number of input functions.\nCrucially, our approach\nallows both the choice of conditioning\ncontext and the granularity of generation,\nfor example characters or tokens, to be\nmarginalised, thus permitting scalable and\neffective training. Using this framework,\nwe address the problem of generating programming code from a mixed natural language and structured specification.\nWe create two new data sets for this paradigm\nderived from the collectible trading card\ngames Magic the Gathering and Hearthstone. On these, and a third preexisting\ncorpus, we demonstrate that marginalising multiple predictors allows our model\nto outperform strong benchmarks.

\n\n", "tags": ["bimodal", "code generation"], "tsne_embedding": [-21.200544357299805, -0.36488035321235657]}, {"key": "ling2020adaptive", "year": "2020", "title": "Adaptive Deep Code Search", "abstract": "

Searching code in a large-scale codebase using natural language queries is a common practice during software development. Deep learning-based code search methods demonstrate superior performance if models are trained with large amount of text-code pairs. However, few deep code search models can be easily transferred from one codebase to another. It can be very costly to prepare training data for a new codebase and re-train an appropriate deep learning model. In this paper, we propose AdaCS, an adaptive deep code search method that can be trained once and transferred to new codebases. AdaCS decomposes the learning process into embedding domain-specific words and matching general syntactic patterns. Firstly, an unsupervised word embedding technique is used to construct a matching matrix to represent the lexical similarities. Then, a recurrent neural network is used to capture latent syntactic patterns from these matching matrices in a supervised way. As the supervised task learns general syntactic patterns that exist across domains, AdaCS is transferable to new codebases. Experimental results show that: when extended to new software projects never seen in the training data, AdaCS is more robust and significantly outperforms state-of-the-art deep code search methods.

\n", "tags": ["search"], "tsne_embedding": [-0.09181664884090424, -14.861451148986816]}, {"key": "ling2020deep", "year": "2020", "title": "Deep Graph Matching and Searching for Semantic Code Retrieval", "abstract": "

Code retrieval is to find the code snippet from a large corpus of source code repositories that highly matches the query of natural language description. Recent work mainly uses natural language processing techniques to process both query texts (i.e., human natural language) and code snippets (i.e., machine programming language), however neglecting the deep structured features of query texts and source codes, both of which contain rich semantic information. In this paper, we propose an end-to-end deep graph matching and searching (DGMS) model based on graph neural networks for the task of semantic code retrieval. To this end, we first represent both natural language query texts and programming language code snippets with the unified graph-structured data, and then use the proposed graph matching and searching model to retrieve the best matching code snippet. In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them by cross-attention based semantic matching operations. We evaluate the proposed DGMS model on two public code retrieval datasets with two representative programming languages (i.e., Java and Python). Experiment results demonstrate that DGMS significantly outperforms state-of-the-art baseline models by a large margin on both datasets. Moreover, our extensive ablation studies systematically investigate and illustrate the impact of each part of DGMS.

\n", "tags": ["search", "GNN"], "tsne_embedding": [-1.2258307933807373, -13.830181121826172]}, {"key": "liu2016towards", "year": "2016", "title": "Towards Better Program Obfuscation: Optimization via Language Models", "abstract": "

As a common practice in software development, program\nobfuscation aims at deterring reverse engineering and malicious attacks on released source or binary code. Owning ample obfuscation techniques, we have relatively little\nknowledge on how to most effectively use them. The biggest\nchallenge lies in identifying the most useful combination of\nthese techniques. We propose a unified framework to automatically generate and optimize obfuscation based on an\nobscurity language model and a Monte Carlo Markov Chain\n(MCMC) based search algorithm. We further instantiate it\nfor JavaScript programs and developed the Closure tool.\nCompared to the well-known Google Closure Compiler, Closure outperforms its default setting by 26%. For programs\nwhich have already been well obfuscated, Closure can still\noutperform by 22%.

\n", "tags": ["deobfuscation"], "tsne_embedding": [18.7750244140625, 18.461368560791016]}, {"key": "liu2018neural", "year": "2018", "title": "Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?", "abstract": "

Commit messages can be regarded as the documentation of software changes. These messages describe the content and purposes of changes, hence are useful for program comprehension and software maintenance. However, due to the lack of time and direct motivation, commit messages sometimes are neglected by developers. To address this problem, Jiang et al. proposed an approach (we refer to it as NMT), which leverages a neural machine translation algorithm to automatically generate short commit messages from code. The reported performance of their approach is promising, however, they did not explore why their approach performs well. Thus, in this paper, we first perform an in-depth analysis of their experimental results. We find that (1) Most of the test <pre>diffs</pre> from which NMT can generate high-quality messages are similar to one or more training <pre>diffs</pre> at the token level. (2) About 16% of the commit messages in Jiang et al.\u2019s dataset are noisy due to being automatically generated or due to them describing repetitive trivial changes. (3) The performance of NMT declines by a large amount after removing such noisy commit messages. In addition, NMT is complicated and time-consuming. Inspired by our first finding, we proposed a simpler and faster approach, named NNGen (Nearest Neighbor Generator), to generate concise commit messages using the nearest neighbor algorithm. Our experimental results show that NNGen is over 2,600 times faster than NMT, and outperforms NMT in terms of BLEU (an accuracy measure that is widely used to evaluate machine translation systems) by 21%. Finally, we also discuss some observations for the road ahead for automated commit message generation to inspire other researchers.

\n", "tags": ["edit", "summarization"], "tsne_embedding": [-16.65733528137207, 3.901207685470581]}, {"key": "liu2019deepfuzz", "year": "2019", "title": "DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing", "abstract": "

Compilers are among the most fundamental programming\ntools for building software. However, production compilers\nremain buggy. Fuzz testing is often leveraged with newly-generated,\nor mutated inputs in order to find new bugs or security vulnerabilities.\nIn this paper, we propose a grammar-based fuzzing tool called DeepFuzz. Based on a generative\nSequence-to-Sequence model, DeepFuzz automatically and continuously generates well-formed\nC programs. We use this set of new C programs to fuzz off-the-shelf C compilers, e.g. GCC and Clang/LLVM.\nWe present a detailed case study to analyze the success rate and coverage improvement of the\ngenerated C programs for fuzz testing. We analyze the performance of DeepFuzz with three types of sampling\nmethods as well as three types of generation strategies. Consequently, DeepFuzz \nimproved the testing efficacy in regards to the line, function, and branch coverage. In our preliminary\nstudy, we found and reported 8 bugs of GCC, all of which are actively being addressed by developers.

\n", "tags": ["fuzzing", "code generation"], "tsne_embedding": [17.72905731201172, 12.589375495910645]}, {"key": "liu2019generating", "year": "2019", "title": "Generating commit messages from diffs using pointer-generator network", "abstract": "

The commit messages in source code repositories are valuable but not easy to be generated manually in time for tracking issues, reporting bugs, and understanding codes. Recently published works indicated that the deep neural machine translation approaches have drawn considerable attentions on automatic generation of commit messages. However, they could not deal with out-of-vocabulary (OOV) words, which are essential context-specific identifiers such as class names and method names in code diffs. In this paper, we propose PtrGNCMsg, a novel approach which is based on an improved sequence-to-sequence model with the pointer-generator network to translate code diffs into commit messages. By searching the smallest identifier set with the highest probability, PtrGNCMsg outperforms recent approaches based on neural machine translation, and first enables the prediction of OOV words. The experimental results based on the corpus of diffs and manual commit messages from the top 2,000 Java projects in GitHub show that PtrGNCMsg outperforms the state-of-the-art approach with improved BLEU by 1.02, ROUGE-1 by 4.00 and ROUGE-L by 3.78, respectively.

\n", "tags": ["edit"], "tsne_embedding": [-17.259321212768555, 3.581491470336914]}, {"key": "liu2019learning", "year": "2019", "title": "Learning to Sport and Refactor Inconsistent Method Names", "abstract": "

To ensure code readability and facilitate software maintenance, program methods must be named properly. In particular, method names must be consistent with the corresponding method implementations. Debugging method names remains an important topic in the literature, where various approaches analyze commonalities among method names in a large dataset to detect inconsistent method names and suggest better ones. We note that the state-of-the-art does not analyze the implemented code itself to assess consistency. We thus propose a novel automated approach to debugging method names based on the analysis of consistency between method names and method code. The approach leverages deep feature representation techniques adapted to the nature of each artifact. Experimental results on over 2.1 million Java methods show that we can achieve up to 15 percentage points improvement over the state-of-the-art, establishing a record performance of 67.9% F1-measure in identifying inconsistent method names. We further demonstrate that our approach yields up to 25% accuracy in suggesting full names, while the state-of-the-art lags far behind at 1.1% accuracy. Finally, we report on our success in fixing 66 inconsistent method names in a live study on projects in the wild.

\n", "tags": ["naming"], "tsne_embedding": [12.971735000610352, -7.927429676055908]}, {"key": "liu2019neural", "year": "2019", "title": "Neural query expansion for code search", "abstract": "

Searching repositories of existing source code for code snippets is a key task in software engineering. Over the years, many approaches to this problem have been proposed. One recent tool called NCS, takes in a natural language query and outputs relevant code snippets, often being able to correctly answer Stack Overflow questions. But what happens when the developer doesn\u2019t provide a query with a clear intent? What if shorter queries are used to demonstrate a more vague intent?

\n\n

We find that the performance of NCS regresses with shorter queries. Furthermore, data from developers\u2019 code search history logs shows that shorter queries have a less successful code search session: there are more query reformulations and more time is spent browsing the results. These observations lead us to believe that using NCS alone with short queries may not be productive enough.

\n\n

In this paper, we explore an additional way of using neural networks in code search: the automatic expansion of queries. We present NQE, a neural model that takes in a set of keywords and predicts a set of keywords to expand the query to NCS. NQE learns to predict keywords that co-occur with the query keywords in the underlying corpus, which helps expand the query in a productive way. Our results show that with query expansion, NQE + NCS is able to perform better than using NCS alone.

\n", "tags": ["search"], "tsne_embedding": [-3.099456310272217, -16.22878074645996]}, {"key": "liu2020automating", "year": "2020", "title": "Automating Just-In-Time Comment Updating", "abstract": "

Code comments are valuable for program comprehension and software maintenance, and also require maintenance with code evolution. However, when changing code, developers sometimes neglect updating the related comments, bringing in inconsistent or obsolete comments (aka., bad comments). Such comments are detrimental since they may mislead developers and lead to future bugs. Therefore, it is necessary to fix and avoid bad comments. In this work, we argue that bad comments can be reduced and even avoided by automatically performing comment updates with code changes. We refer to this task as \u201cJust-In-Time (JIT) Comment Updating\u201d and propose an approach named CUP (Comment UPdater) to automate this task. CUP can be used to assist developers in updating comments during code changes and can consequently help avoid the introduction of bad comments. Specifically, CUP leverages a novel neural sequence-to-sequence model to learn comment update patterns from extant code-comment co-changes and can automatically generate a new comment based on its corresponding old comment and code change. Several customized enhancements, such as a special tokenizer and a novel co-attention mechanism, are introduced in CUP by us to handle the characteristics of this task. We build a dataset with over 108K comment-code co-change samples and evaluate CUP on it. The evaluation results show that CUP outperforms an information-retrieval-based and a rule-based baselines by substantial margins, and can reduce developers\u2019 edits required for JIT comment updating. In addition, the comments generated by our approach are identical to those updated by developers in 1612 (16.7%) test samples, 7 times more than the best-performing baseline.

\n", "tags": ["documentation"], "tsne_embedding": [-15.681200981140137, -0.24859996140003204]}, {"key": "liu2022open", "year": "2022", "title": "Open-ended Knowledge Tracing", "abstract": "

In education applications, knowledge tracing refers to the problem of estimating students\u2019 time-varying concept/skill mastery level from their past responses to questions and predicting their future performance. One key limitation of most existing knowledge tracing methods is that they treat student responses to questions as binary-valued, i.e., whether they are correct or incorrect. Response correctness analysis/prediction ignores important information on student knowledge contained in the exact content of the responses, especially for open-ended questions. In this paper, we conduct the first exploration into open-ended knowledge tracing (OKT) by studying the new task of predicting students\u2019 exact open-ended responses to questions. Our work is grounded in the domain of computer science education with programming questions. We develop an initial solution to the OKT problem, a student knowledge-guided code generation approach, that combines program synthesis methods using language models with student knowledge tracing methods. We also conduct a series of quantitative and qualitative experiments on a real-world student code dataset to validate OKT and demonstrate its promise in educational applications.

\n", "tags": ["education", "code generation"], "tsne_embedding": [-13.949161529541016, 18.541017532348633]}, {"key": "liu2023code", "year": "2023", "title": "Code Execution with Pre-trained Language Models", "abstract": "

Code execution is a fundamental aspect of programming language semantics that reflects the exact behavior of the code. However, most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. In this paper, we investigate how well pre-trained models can understand and perform code execution. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution, which challenges existing models such as Codex. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension. We evaluate CodeExecutor on code execution and show its promising performance and limitations. We also demonstrate its potential benefits for code intelligence tasks such as zero-shot code-to-code search and text-to-code generation. Our analysis provides insights into the learning and generalization abilities of pre-trained models for code execution.

\n", "tags": ["Transformer", "execution"], "tsne_embedding": [-1.6615852117538452, -3.3373091220855713]}, {"key": "lomshakov2023fine", "year": "2023", "title": "Fine-Tuning Large Language Models for Answering Programming Questions with Code Snippets", "abstract": "

We study the ability of pretrained large language models (LLM) to answer questions from online question answering fora such as Stack Overflow. We consider question-answer pairs where the main part of the answer consists of source code. On two benchmark datasets \u2014 CoNaLa and a newly collected dataset based on Stack Overflow \u2014 we investigate how a closed-book question answering system can be improved by fine-tuning the LLM for the downstream task, prompt engineering, and data preprocessing. We use publicly available autoregressive language models such as GPT-Neo, CodeGen, and PanGu-Coder, and after the proposed fine-tuning achieve a BLEU score of 0.4432 on the CoNaLa test set, significantly exceeding previous state of the art for this task.

\n", "tags": ["program synthesis", "question answering", "large language models"], "tsne_embedding": [4.205728054046631, -2.7051892280578613]}, {"key": "louis2018deep", "year": "2018", "title": "Deep Learning to Detect Redundant Method Comments", "abstract": "

Comments in software are critical for maintenance and reuse. But apart from prescriptive advice, there is little practical support or quantitative understanding of what makes a comment useful. In this paper, we introduce the task of identifying comments which are uninformative about the code they are meant to document. To address this problem, we introduce the notion of comment entailment from code, high entailment indicating that a comment\u2019s natural language semantics can be inferred directly from the code. Although not all entailed comments are low quality, comments that are too easily inferred, for example, comments that restate the code, are widely discouraged by authorities on software style. Based on this, we develop a tool called CRAIC which scores method-level comments for redundancy. Highly redundant comments can then be expanded or alternately removed by the developer. CRAIC uses deep language models to exploit large software corpora without requiring expensive manual annotations of entailment. We show that CRAIC can perform the comment entailment task with good agreement with human judgements. Our findings also have implications for documentation tools. For example, we find that common tags in Javadoc are at least two times more predictable from code than non-Javadoc sentences, suggesting that Javadoc tags are less informative than more free-form comments

\n", "tags": ["bimodal", "documentation"], "tsne_embedding": [-14.615335464477539, -2.3218743801116943]}, {"key": "louis2020where", "year": "2020", "title": "Where should I comment my code? A dataset and model for predicting locations that need comments", "abstract": "

Programmers should write code comments, but not on every line\nof code. We have created a machine learning model that suggests\nlocations where a programmer should write a code comment. We\ntrained it on existing commented code to learn locations that are\nchosen by developers. Once trained, the model can predict locations\nin new code. Our models achieved precision of 74% and recall of\n13% in identifying comment-worthy locations. This first success\nopens the door to future work, both in the new where-to-comment\nproblem and in guiding comment generation.

\n", "tags": ["bimodal", "documentation"], "tsne_embedding": [-16.752710342407227, -2.086212635040283]}, {"key": "loyola2017neural", "year": "2017", "title": "A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes", "abstract": "

We propose a model to automatically describe changes introduced in the source code of a program using natural language. Our method receives as input a set of code commits, which contains both the modifications and message introduced by an user. These two modalities are used to train an encoder-decoder architecture. We evaluated our approach on twelve real world open source projects from four different programming languages. Quantitative and qualitative results showed that the proposed approach can generate feasible and semantically sound descriptions not only in standard in-project settings, but also in a cross-project setting.

\n", "tags": ["edit", "summarization"], "tsne_embedding": [-16.094999313354492, 1.9536467790603638]}, {"key": "loyola2018content", "year": "2018", "title": "Content Aware Source Code Change Description Generation", "abstract": "

We propose to study the generation of descriptions from source code changes by integrating the messages included on code\ncommits and the intra-code documentation\ninside the source in the form of docstrings.\nOur hypothesis is that although both types\nof descriptions are not directly aligned in\nsemantic terms \u2014one explaining a change\nand the other the actual functionality of\nthe code being modified\u2014 there could be\ncertain common ground that is useful for\nthe generation. To this end, we propose\nan architecture that uses the source code-docstring relationship to guide the description generation. We discuss the results of\nthe approach comparing against a baseline\nbased on a sequence-to-sequence model,\nusing standard automatic natural language\ngeneration metrics as well as with a human\nstudy, thus offering a comprehensive view\nof the feasibility of the approach.

\n", "tags": ["edit", "summarization"], "tsne_embedding": [-15.96943473815918, 1.617795467376709]}, {"key": "lu2019program", "year": "2019", "title": "Program Classification Using Gated Graph Attention Neural Network for Online Programming Service", "abstract": "

The online programing services, such as Github, TopCoder, and EduCoder, have promoted a lot of social interactions among the service users. However, the existing social interactions is rather limited and inefficient due to the rapid increasing of source-code repositories, which is difficult to explore manually. The emergence of source-code mining provides a promising way to analyze those source codes, so that those source codes can be relatively easy to understand and share among those service users. Among all the source-code mining attempts,program classification lays a foundation for various tasks related to source-code understanding, because it is impossible for a machine to understand a computer program if it cannot classify the program correctly. Although numerous machine learning models, such as the Natural Language Processing (NLP) based models and the Abstract Syntax Tree (AST) based models, have been proposed to classify computer programs based on their corresponding source codes, the existing works cannot fully characterize the source codes from the perspective of both the syntax and semantic information. To address this problem, we proposed a Graph Neural Network (GNN) based model, which integrates data flow and function call information to the AST,and applies an improved GNN model to the integrated graph, so as to achieve the state-of-art program classification accuracy. The experiment results have shown that the proposed work can classify programs with accuracy over 97%.

\n", "tags": ["GNN", "representation"], "tsne_embedding": [-4.141998291015625, 12.793905258178711]}, {"key": "lu2021codexglue", "year": "2021", "title": "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation", "abstract": "

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems.

\n", "tags": ["benchmark", "Transformer"], "tsne_embedding": [0.7703889608383179, 1.1770837306976318]}, {"key": "lu2022reacc", "year": "2022", "title": "ReACC: A Retrieval-Augmented Code Completion Framework", "abstract": "

Code completion, which aims to predict the following code token(s) according to the code context, can improve the productivity of software development. Recent work has proved that statistical language modeling with transformers can greatly improve the performance in the code completion task via learning from large-scale source code datasets. However, current approaches focus only on code context within the file or project, i.e. internal context. Our distinction is utilizing \u201cexternal\u201d context, inspired by human behaviors of copying from the related code snippets when writing code. Specifically, we propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We adopt a stage-wise training approach that combines a source code retriever and an auto-regressive language model for programming language. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.

\n", "tags": ["Transformer", "autocomplete"], "tsne_embedding": [-8.362076759338379, -15.915071487426758]}, {"key": "luan2019aroma", "year": "2015", "title": "Aroma: code recommendation via structural code search", "abstract": "

Programmers often write code that has similarity to existing code written somewhere. A tool that could help programmers to search such similar code would be immensely useful. Such a tool could help programmers to extend partially written code snippets to completely implement necessary functionality, help to discover extensions to the partial code which are commonly included by other programmers, help to cross-check against similar code written by other programmers, or help to add extra code which would fix common mistakes and errors. We propose Aroma, a tool and technique for code recommendation via structural code search. Aroma indexes a huge code corpus including thousands of open-source projects, takes a partial code snippet as input, searches the corpus for method bodies containing the partial code snippet, and clusters and intersects the results of the search to recommend a small set of succinct code snippets which both contain the query snippet and appear as part of several methods in the corpus. We evaluated Aroma on 2000 randomly selected queries created from the corpus, as well as 64 queries derived from code snippets obtained from Stack Overflow, a popular website for discussing code. We implemented Aroma for 4 different languages, and developed an IDE plugin for Aroma. Furthermore, we conducted a study where we asked 12 programmers to complete programming tasks using Aroma, and collected their feedback. Our results indicate that Aroma is capable of retrieving and recommending relevant code snippets efficiently.

\n", "tags": ["search"], "tsne_embedding": [-5.367781639099121, -17.75845718383789]}, {"key": "maddison2014structured", "year": "2014", "title": "Structured Generative Models of Natural Source Code", "abstract": "

We study the problem of building generative\nmodels of natural source code (NSC); that is,\nsource code written by humans and meant to\nbe understood by humans. Our primary con-\ntribution is to describe new generative models\nthat are tailored to NSC. The models are based\non probabilistic context free grammars (PCFGs)\nand neuro-probabilistic language models (Mnih\n& Teh, 2012), which are extended to incorporate\nadditional source code-specific structure. These\nmodels can be efficiently trained on a corpus\nof source code and outperform a variety of less\nstructured baselines in terms of predictive log\nlikelihoods on held-out data.

\n\n", "tags": ["language model", "code generation", "grammar", "grammar"], "tsne_embedding": [-20.122058868408203, -1.869789958000183]}, {"key": "mahmud2021code", "year": "2021", "title": "Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors", "abstract": "

Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to \u201ctranslate\u201d code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an error taxonomy that can be used to drive future research efforts.

\n", "tags": ["survey", "summarization", "Transformer"], "tsne_embedding": [-16.167753219604492, -12.526226043701172]}, {"key": "malik2019nl2type", "year": "2019", "title": "NL2Type: Inferring JavaScript Function Types from Natural Language Information", "abstract": "

JavaScript is dynamically typed and hence lacks thetype safety of statically typed languages,\nleading to suboptimal IDE support, difficult to understand APIs, and unexpected run-time behavior.\nSeveral gradual type systems have been proposed, e.g., Flow and TypeScript, but they rely on developers\nto annotatecode with types. This paper presents NL2Type, a learning-based approach for predicting likely\ntype signatures of JavaScript functions. The key idea is to exploit natural language information in\nsource code, such as comments, function names, and parameternames, a rich source of knowledge\nthat is typically ignored by type inference algorithms. We formulate the problem of predicting\ntypes as a classification problem and train a recurrent, LSTM-based neural model that, after learning\nfrom an annotatedcode base, predicts function types for unannotated code. We evaluate the \napproach with a corpus of 162,673 JavaScript files from real-world projects. \nNL2Type predicts types with aprecision of 84.1% and a recall of 78.9% when considering only\nthe top-most suggestion, and with a precision of 95.5% and arecall of 89.6% when\nconsidering the top-5 suggestions. The\napproach outperforms both JSNice, a state-of-the-art approach that analyzes implementations \nof functions instead of natural language information, and DeepTyper, a recent type prediction\napproach that is also based on deep learning. Beyond predicting types, NL2Type serves as a\nconsistency checker for existing type annotations. We show that it discovers 39 inconsistencies\nthat deserve developer attention (from a manual analysis of 50 warnings), most of which \nare due to incorrect type annotations.

\n", "tags": ["bimodal", "types"], "tsne_embedding": [-2.5574936866760254, 26.838871002197266]}, {"key": "mammadli2020static", "year": "2020", "title": "Static Neural Compiler Optimization via Deep Reinforcement Learning", "abstract": "

The phase-ordering problem of modern compilers has received a lot of attention from the research community over the years, yet remains largely unsolved. Various optimization sequences exposed to the user are manually designed by compiler developers. In designing such a sequence developers have to choose the set of optimization passes, their parameters and ordering within a sequence. Resulting sequences usually fall short of achieving optimal runtime for a given source code and may sometimes even degrade the performance when compared to unoptimized version. In this paper, we employ a deep reinforcement learning approach to the phase-ordering problem. Provided with sub-sequences constituting LLVM\u2019s O3 sequence, our agent learns to outperform the O3 sequence on the set of source codes used for training and achieves competitive performance on the validation set, gaining up to 1.32x speedup on previously-unseen programs. Notably, our approach differs from autotuning methods by not depending on one or more test runs of the program for making successful optimization decisions. It has no dependence on any dynamic feature, but only on the statically-attainable intermediate representation of the source code. We believe that the models trained using our approach can be integrated into modern compilers as neural optimization agents, at first to complement, and eventually replace the hand-crafted optimization sequences.

\n", "tags": ["compilation"], "tsne_embedding": [6.3098225593566895, 10.556711196899414]}, {"key": "mangal2015user", "year": "2015", "title": "A User-Guided Approach to Program Analysis", "abstract": "

Program analysis tools often produce undesirable output\ndue to various approximations. We present an approach\nand a system Eugene that allows user feedback to guide\nsuch approximations towards producing the desired output.\nWe formulate the problem of user-guided program analysis in terms of solving a combination of hard rules and soft\nrules: hard rules capture soundness while soft rules capture\ndegrees of approximations and preferences of users. Our\ntechnique solves the rules using an off-the-shelf solver in a\nmanner that is sound (satisfies all hard rules), optimal (maximally satisfies soft rules), and scales to real-world analy-\nses and programs. We evaluate Eugene on two different\nanalyses with labeled output on a suite of seven Java pro-\ngrams of size 131\u2013198 KLOC. We also report upon a user\nstudy involving nine users who employ Eugene to guide an\ninformation-flow analysis on three Java micro-benchmarks.\nIn our experiments, Eugene significantly reduces misclassified reports upon providing limited amounts of feedback.

\n", "tags": ["program analysis"], "tsne_embedding": [22.891752243041992, 12.036048889160156]}, {"key": "markovtsev2017topic", "year": "2017", "title": "Topic modeling of public repositories at scale using names in source code", "abstract": "

Programming languages themselves have a limited number of reserved keywords and character based tokens that\ndefine the language specification. However, programmers have a rich use of natural language within their code\nthrough comments, text literals and naming entities. The programmer defined names that can be found in source\ncode are a rich source of information to build a high level understanding of the project. The goal of this paper\nis to apply topic modeling to names used in over 13.6 million repositories and perceive the inferred topics.\nOne of the problems in such a study is the occurrence of duplicate repositories not officially marked as forks (obscure forks).\nWe show how to address it using the same identifiers which are extracted for topic modeling.

\n\n

We open with a discussion on naming in source code, we then elaborate on our approach to remove exact duplicate\nand fuzzy duplicate repositories using Locality Sensitive Hashing on the bag-of-words model and then discuss our work\non topic modeling; and finally present the results from our data analysis together with open-access to the source code,\ntools and datasets.

\n", "tags": ["topic modeling", "pattern mining"], "tsne_embedding": [10.757800102233887, -9.122387886047363]}, {"key": "markovtsev2018public", "year": "2018", "title": "Public Git Archive: a Big Code dataset for all", "abstract": "

The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research potential enabled by the available open source code is clearly substantial, no significant large-scale open source code datasets exist. In this paper, we present the Public Git Archive \u2013 dataset of 182,014 top-bookmarked Git repositories from GitHub. We describe the novel data retrieval pipeline to reproduce it. We also elaborate on the strategy for performing dataset updates and legal issues. The Public Git Archive occupies 3.0 TB on disk and is an order of magnitude larger than the current source code datasets. The dataset is made available through HTTP and provides the source code of the projects, the related metadata, and development history. The data retrieval pipeline employs an optimized worker queue model and an optimized archive format to efficiently store forked Git repositories, reducing the amount of data to download and persist. Public Git Archive aims to open a myriad of new opportunities for Big Code research.

\n", "tags": ["dataset"], "tsne_embedding": [6.638616561889648, -6.063177585601807]}, {"key": "markovtsev2019style", "year": "2019", "title": "STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms", "abstract": "

Source code reviews are manual, time-consuming, and expensive. Human involvement should be focused on analyzing the most relevant aspects of the program, such as logic and maintainability, rather than amending style, syntax, or formatting defects. Some tools with linting capabilities can format code automatically and report various stylistic violations for supported programming languages. They are based on rules written by domain experts, hence, their configuration is often tedious, and it is impractical for the given set of rules to cover all possible corner cases. Some machine learning-based solutions exist, but they remain uninterpretable black boxes. This paper introduces STYLE-ANALYZER, a new open source tool to automatically fix code formatting violations using the decision tree forest model which adapts to each codebase and is fully unsupervised. STYLE-ANALYZER is built on top of our novel assisted code review framework, Lookout. It accurately mines the formatting style of each analyzed Git repository and expresses the found format patterns with compact human-readable rules. STYLE-ANALYZER can then suggest style inconsistency fixes in the form of code review comments. We evaluate the output quality and practical relevance of STYLE-ANALYZER by demonstrating that it can reproduce the original style with high precision, measured on 19 popular JavaScript projects, and by showing that it yields promising results in fixing real style mistakes. STYLE-ANALYZER includes a web application to visualize how the rules are triggered. We release STYLE-ANALYZER as a reusable and extendable open source software package on GitHub for the benefit of the community.

\n", "tags": ["style"], "tsne_embedding": [-22.413660049438477, -13.134105682373047]}, {"key": "mastropaolo2022using", "year": "2022", "title": "Using Deep Learning to Generate Complete Log Statements", "abstract": "

Logging is a practice widely adopted in several phases of the software lifecycle. For example, during software development log statements allow engineers to verify and debug the system by exposing fine-grained information of the running software. While the benefits of logging are undisputed, taking proper decisions about where to inject log statements, what information to log, and at which log level (e.g., error, warning) is crucial for the logging effectiveness. In this paper, we present LANCE (Log stAtemeNt reCommEnder), the first approach supporting developers in all these decisions. LANCE features a Text-To-Text-Transfer-Transformer (T5) model that has been trained on 6,894,456 Java methods. LANCE takes as input a Java method and injects in it a full log statement, including a human-comprehensible logging message and properly choosing the needed log level and the statement location. Our results show that LANCE is able to (i) properly identify the location in the code where to inject the statement in 65.9% of Java methods requiring it; (ii) selecting the proper log level in 66.2% of cases; and (iii) generate a completely correct log statement including a meaningful logging message in 15.2% of cases.

\n", "tags": ["Transformer", "logging"], "tsne_embedding": [-5.456480026245117, 10.941608428955078]}, {"key": "mehrotra2020modeling", "year": "2020", "title": "Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks", "abstract": "

Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted in the past to detect clones. A majority of these approaches use lexical and syntactic information to detect clones. However, only a few of them target semantic clones. Recently, motivated by the success of deep learning models in other fields, including natural language processing and computer vision, researchers have attempted to adopt deep learning techniques to detect code clones. These approaches use lexical information (tokens) and(or) syntactic structures like abstract syntax trees (ASTs) to detect code clones. However, they do not make sufficient use of the available structural and semantic information hence, limiting their capabilities.

\n\n

This paper addresses the problem of semantic code clone detection using program dependency graphs and geometric neural networks, leveraging the structured syntactic and semantic information. We have developed a prototype tool HOLMES, based on our novel approach, and empirically evaluated it on popular code clone benchmarks. Our results show that HOLMES performs considerably better than the other state-of-the-art tool, TBCCD. We also evaluated HOLMES on unseen projects and performed cross dataset experiments to assess the generalizability of HOLMES. Our results affirm that HOLMES outperforms TBCCD since most of the pairs that HOLMES detected were either undetected or suboptimally reported by TBCCD.

\n", "tags": ["clone", "GNN"], "tsne_embedding": [3.2609245777130127, -7.887806415557861]}, {"key": "menon2013machine", "year": "2013", "title": "A Machine Learning Framework for Programming by Example", "abstract": "

Learning programs is a timely and interesting challenge. In Programming by Example\n(PBE), a system attempts to infer a program\nfrom input and output examples alone, by\nsearching for a composition of some set of\nbase functions. We show how machine learning can be used to speed up this seemingly\nhopeless search problem, by learning weights\nthat relate textual features describing the\nprovided input-output examples to plausible\nsub-components of a program. This generic\nlearning framework lets us address problems\nbeyond the scope of earlier PBE systems.\nExperiments on a prototype implementation\nshow that learning improves search and ranking on a variety of text processing tasks found\non help forums.

\n", "tags": ["code generation"], "tsne_embedding": [-11.653495788574219, 16.389467239379883]}, {"key": "mesbah2019deepdelta", "year": "2019", "title": "DeepDelta: Learning to Repair Compilation Errors", "abstract": "

Programmers spend a substantial amount of time manually repairing\ncode that does not compile. We observe that the repairs for\nany particular error class typically follow a pattern and are highly\nmechanical. We propose a novel approach that automatically learns\nthese patterns with a deep neural network and suggests program\nrepairs for the most costly classes of build-time compilation failures.\nWe describe how we collect all build errors and the human-authored,\nin-progress code changes that cause those failing builds to transition\nto successful builds at Google. We generate an AST diff from the\ntextual code changes and transform it into a domain-specific\nlanguage called Delta that encodes the change that must be made\nto make the code compile. We then feed the compiler diagnostic\ninformation (as source) and the Delta changes that resolved the\ndiagnostic (as target) into a Neural Machine Translation network for\ntraining. For the two most prevalent and costly classes of Java compilation errors,\nnamely missing symbols and mismatched methodsignatures, our system called DeepDelta,\ngenerates the correct repair changes for 19,314 out of 38,788 (50%) of unseen compilation\nerrors. The correct changes are in the top three suggested axes 86% of the time on average.

\n", "tags": ["repair", "edit", "compilation"], "tsne_embedding": [16.027000427246094, -1.2714016437530518]}, {"key": "mir2021manytypes4py", "year": "2021", "title": "ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-based Type Inference", "abstract": "

In this paper, we present ManyTypes4Py, a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5,382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of the duplication bias. To facilitate training and evaluation of ML models, the dataset was split into training, validation and test sets by files. To extract type information from abstract syntax trees (ASTs), a lightweight static analyzer pipeline is developed and accompanied with the dataset. Using this pipeline, the collected Python projects were analyzed and the results of the AST analysis were stored in JSON-formatted files. The ManyTypes4Py dataset is shared on zenodo and its tools are publicly available on GitHub.

\n", "tags": ["dataset", "types"], "tsne_embedding": [-3.1307759284973145, 30.127897262573242]}, {"key": "mir2021type4py", "year": "2021", "title": "Type4Py: Deep Similarity Learning-Based Type Inference for Python", "abstract": "

Dynamic languages, such as Python and Javascript, trade static typing for developer flexibility. While this allegedly enables greater productivity, lack of static typing can cause runtime exceptions, type inconsistencies, and is a major factor for weak IDE support. To alleviate these issues, PEP 484 introduced optional type annotations for Python. As retrofitting types to existing codebases is error-prone and laborious, learning-based approaches have been proposed to enable automatic type annotations based on existing, partially annotated codebases. However, the prediction of rare and user-defined types is still challenging. In this paper, we present Type4Py, a deep similarity learning-based type inference model for Python. We design a hierarchical neural network model that learns to discriminate between types of the same kind and dissimilar types in a high-dimensional space, which results in clusters of types. Nearest neighbor search suggests likely type signatures of given Python functions. The types visible to analyzed modules are surfaced using lightweight dependency analysis. The results of quantitative and qualitative evaluation indicate that Type4Py significantly outperforms state-of-the-art approaches at the type prediction task. Considering the Top-1 prediction, Type4Py obtains 19.33% and 13.49% higher precision than Typilus and TypeWriter, respectively, while utilizing a much bigger vocabulary.

\n", "tags": ["types"], "tsne_embedding": [-3.1428678035736084, 28.9696102142334]}, {"key": "mohajer2023skipanalyzer", "year": "2023", "title": "SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models", "abstract": "

We introduce SkipAnalyzer, a large language model (LLM)-powered tool for static code analysis. SkipAnalyzer has three components: 1) an LLM-based static bug detector that scans source code and reports specific types of bugs, 2) an LLM-based false-positive filter that can identify false-positive bugs in the results of static bug detectors (e.g., the result of step 1) to improve detection accuracy, and 3) an LLM-based patch generator that can generate patches for the detected bugs above. As a proof-of-concept, SkipAnalyzer is built on ChatGPT, which has exhibited outstanding performance in various software engineering tasks. To evaluate SkipAnalyzer, we focus on two types of typical and critical bugs that are targeted by static bug detection, i.e., Null Dereference and Resource Leak as subjects. We employ Infer to aid the gathering of these two bug types from 10 open-source projects. Consequently, our experiment dataset contains 222 instances of Null Dereference bugs and 46 instances of Resource Leak bugs. Our study demonstrates that SkipAnalyzer achieves remarkable performance in the mentioned static analysis tasks, including bug detection, false-positive warning removal, and bug repair. In static bug detection, SkipAnalyzer achieves accuracy values of up to 68.37% for detecting Null Dereference bugs and 76.95% for detecting Resource Leak bugs, improving the precision of the current leading bug detector, Infer, by 12.86% and 43.13%, respectively. For removing false-positive warnings, SkipAnalyzer can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for Resource Leak bugs. Additionally, SkipAnalyzer surpasses state-of-the-art false-positive warning removal tools. Furthermore, in bug repair, SkipAnalyzer can generate syntactically correct patches to fix its detected bugs with a success rate of up to 97.30%.

\n", "tags": ["repair"], "tsne_embedding": [18.988479614257812, 9.034433364868164]}, {"key": "monperrus2021megadiff", "year": "2021", "title": "Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size", "abstract": "

This paper presents Megadiff, a dataset of source code diffs. It focuses on Java, with strict inclusion criteria based on commit message and diff size. Megadiff contains 663 029 Java diffs that can be used for research on commit comprehension, fault localization, automated program repair, and machine learning on code changes.

\n", "tags": ["dataset", "edit"], "tsne_embedding": [19.32708740234375, -9.703049659729004]}, {"key": "mou2014building", "year": "2014", "title": "Building Program Vector Representations for Deep Learning", "abstract": "

Deep learning has made significant breakthroughs\nin various fields of artificial intelligence. Advantages of deep\nlearning include the ability to capture highly complicated features, weak involvement of human engineering, etc. However,\nit is still virtually impossible to use deep learning to analyze\nprograms since deep architectures cannot be trained effectively\nwith pure back propagation. In this pioneering paper, we propose\nthe \u201ccoding criterion\u201d to build program vector representations,\nwhich are the premise of deep learning for program analysis. Our\nrepresentation learning approach directly makes deep learning a\nreality in this new field. We evaluate the learned vector representations both qualitatively and quantitatively. We conclude, based\non the experiments, the coding criterion is successful in building\nprogram representations. To evaluate whether deep learning\nis beneficial for program analysis, we feed the representations\nto deep neural networks, and achieve higher accuracy in the\nprogram classification task than \u201cshallow\u201d methods, such as\nlogistic regression and the support vector machine. This result\nconfirms the feasibility of deep learning to analyze programs. It\nalso gives primary evidence of its success in this new field. We\nbelieve deep learning will become an outstanding technique for\nprogram analysis in the near future.

\n\n", "tags": ["representation", "grammar"], "tsne_embedding": [3.8599092960357666, 14.600680351257324]}, {"key": "mou2016convolutional", "year": "2016", "title": "Convolutional Neural Networks over Tree Structures for Programming Language Processing", "abstract": "

Programming language processing (similar to natural language processing) is a hot research topic in the field of software engineering; it has also aroused growing interest in the\nartificial intelligence community. However, different from a\nnatural language sentence, a program contains rich, explicit,\nand complicated structural information. Hence, traditional\nNLP models may be inappropriate for programs. In this paper, we propose a novel tree-based convolutional neural network (TBCNN) for programming language processing, in\nwhich a convolution kernel is designed over programs\u2019 abstract syntax trees to capture structural information. TBCNN\nis a generic architecture for programming language processing; our experiments show its effectiveness in two different program analysis tasks: classifying programs according\nto functionality, and detecting code snippets of certain patterns. TBCNN outperforms baseline methods, including several neural models for NLP.

\n", "tags": ["representation", "grammar"], "tsne_embedding": [-6.091647148132324, 13.779674530029297]}, {"key": "movshovitz2013natural", "year": "2013", "title": "Natural Language Models for Predicting Programming Comments", "abstract": "

Statistical language models have successfully been used to describe and analyze\nnatural language documents. Recent work\napplying language models to programming languages is focused on the task\nof predicting code, while mainly ignoring\nthe prediction of programmer comments.\nIn this work, we predict comments from\nJAVA source files of open source projects,\nusing topic models and n-grams, and we\nanalyze the performance of the models\ngiven varying amounts of background data\non the project being predicted. We evaluate models on their comment-completion\ncapability in a setting similar to code completion tools built into standard code\neditors, and show that using a comment\ncompletion tool can save up to 47% of the\ncomment typing.

\n\n", "tags": ["bimodal", "documentation", "summarization"], "tsne_embedding": [-12.481898307800293, -15.966485977172852]}, {"key": "movshovitz2015kb", "year": "2015", "title": "KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts", "abstract": "

Many existing knowledge bases (KBs), including Freebase, Yago, and NELL, rely\non a fixed ontology, given as an input\nto the system, which defines the data to\nbe cataloged in the KB, i.e., a hierarchy of categories and relations between\nthem. The system then extracts facts that\nmatch the predefined ontology. We propose an unsupervised model that jointly\nlearns a latent ontological structure of an\ninput corpus, and identifies facts from the\ncorpus that match the learned structure.\nOur approach combines mixed membership stochastic block models and topic\nmodels to infer a structure by jointly modeling text, a latent concept hierarchy, and\nlatent semantic relationships among the\nentities mentioned in the text. As a case\nstudy, we apply the model to a corpus\nof Web documents from the software domain, and evaluate the accuracy of the various components of the learned ontology.

\n", "tags": ["pattern mining"], "tsne_embedding": [-6.8738555908203125, -22.68680763244629]}, {"key": "muennighoff2023octopack", "year": "2023", "title": "OctoPack: Instruction Tuning Code Large Language Models", "abstract": "

Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with human instructions. We compile CommitPack: 4 terabytes of Git commits across 350 programming languages. We benchmark CommitPack against other natural and synthetic code instructions (xP3x, Self-Instruct, OASST) on the 16B parameter StarCoder model, and achieve state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark (46.2% pass@1). We further introduce HumanEvalPack, expanding the HumanEval benchmark to a total of 3 coding tasks (Code Repair, Code Explanation, Code Synthesis) across 6 languages (Python, JavaScript, Java, Go, C++, Rust). Our models, OctoCoder and OctoGeeX, achieve the best performance across HumanEvalPack among all permissive models, demonstrating CommitPack\u2019s benefits in generalizing to a wider set of languages and natural coding tasks. Code, models and data are freely available at https://github.com/bigcode-project/octopack.

\n", "tags": ["dataset", "instruction tuning"], "tsne_embedding": [1.1003156900405884, 3.093747854232788]}, {"key": "mukherjee2020searching", "year": "2020", "title": "Searching a Database of Source Codes Using Contextualized Code Search", "abstract": "

We assume a database containing a large set of program source codes and consider the problem of contextualized code search over that database. A programmer has written some part of a program, but has left part of the program (such as a method or a function body) incomplete. The goal is to use the context surrounding the missing code to automatically \u2018figure out\u2019 which of the codes in the database would be useful to the programmer in order to help complete the missing code, in the sense that the programmer could either re-purpose the retrieved code and use the re-purposed code to fill the missing spot in the program. Or, the user could use the retrieved code as a model for implementing the missing code. The search is \u2018contextualized\u2019 in the sense that the search engine should use clues in the partially-completed code to figure out which database code is most useful. The user should not be required to formulate an explicit query.

\n\n

We cast contextualized code search as a learning problem, where the goal is to learn a distribution function computing the likelihood that each database code completes the program, and propose a neural model for predicting which database code is likely to be most useful. Because it will be prohibitively expensive to apply a neural model to each code in a database of millions or billions of codes at search time, one of our key technical concerns is ensuring a speedy search. We address this by learning a \u2018reverse encoder\u2019 that can be used to reduce the problem of evaluating each database code to computing a convolution of two normal distributions, making it possible to search a large database of codes in a reasonable time.

\n", "tags": ["search", "representation"], "tsne_embedding": [-1.9574429988861084, -17.86488151550293]}, {"key": "mukherjee2021neural", "year": "2021", "title": "Neural Program Generation Modulo Static Analysis", "abstract": "

State-of-the-art neural models of source code tend to be evaluated on the generation\nof individual expressions and lines of code, and commonly fail on long-horizon\ntasks such as the generation of entire method bodies. We propose to address this\ndeficiency using weak supervision from a static program analyzer. Our neurosymbolic method allows a deep generative model to symbolically compute, using calls\nto a static-analysis tool, long-distance semantic relationships in the code that it\nhas already generated. During training, the model observes these relationships\nand learns to generate programs conditioned on them. We apply our approach to\nthe problem of generating entire Java methods given the remainder of the class\nthat contains the method. Our experiments show that the approach substantially\noutperforms state-of-the-art transformers and a model that explicitly tries to learn\nprogram semantics on this task, both in terms of producing programs free of basic\nsemantic errors and in terms of syntactically matching the ground truth.

\n", "tags": ["synthesis", "language model"], "tsne_embedding": [9.765189170837402, 5.726444721221924]}, {"key": "murali2017bayesian", "year": "2018", "title": "Bayesian Sketch Learning for Program Synthesis", "abstract": "

We present a Bayesian statistical approach to the problem of automatic program synthesis. Our synthesizer starts\nby learning, offline and from an existing corpus, a probabilistic model of real-world programs. During synthesis,\nit is provided some ambiguous and incomplete evidence about the nature of the programming task that the user\nwants automated, for example sets of API calls or data types that are relevant for the task. Given this input, the\nsynthesizer infers a posterior distribution over type-safe programs that assigns higher likelihood to programs\nthat, according to the learned model, are more likely to match the evidence.

\n\n

We realize this approach using two key ideas. First, our learning techniques operate not over code but\nsyntactic abstractions, or sketches, of programs. During synthesis, we infer a posterior distribution over sketches,\nthen concretize samples from this distribution into type-safe programs using combinatorial techniques. Second,\nour statistical model explicitly models the full intent behind a synthesis task as a latent variable. To infer\nsketches, we first estimate a posterior distribution on the intent, then use samples from this posterior to generate\na distribution over possible sketches. We show that our model can be implemented effectively using the new\nneural architecture of Bayesian encoder-decoders, which can be trained with stochastic gradient descent and\nyields a simple inference procedure.

\n\n

We implement our ideas in a system, called BAYOU , for the synthesis of API-heavy Java methods. We train\nBAYOU on a large corpus of Android apps, and find that the trained system can often synthesize complex\nmethods given just a few API method names or data types as evidence. The experiments also justify the design\nchoice of using a latent intent variable and the levels of abstraction at which sketches and evidence are defined.

\n", "tags": ["code generation", "API"], "tsne_embedding": [7.878736972808838, 6.705506801605225]}, {"key": "murali2017finding", "year": "2017", "title": "Finding Likely Errors with Bayesian Specifications", "abstract": "

We present a Bayesian framework for learning probabilistic specifications from large, unstructured code corpora, and\na method to use this framework to statically detect anomalous, hence likely buggy, program behavior. The distinctive\ninsight here is to build a statistical model that correlates all\nspecifications hidden inside a corpus with the syntax and\nobserved behavior of programs that implement these specifications. During the analysis of a particular program, this\nmodel is conditioned into a posterior distribution that prioritizes specifications that are relevant to this program. This\nallows accurate program analysis even if the corpus is highly\nheterogeneous. The problem of finding anomalies is now\nframed quantitatively, as a problem of computing a distance\nbetween a \u201creference distribution\u201d over program behaviors\nthat our model expects from the program, and the distribution over behaviors that the program actually produces.

\n\n

We present a concrete embodiment of our framework that\ncombines a topic model and a neural network model to learn\nspecifications, and queries the learned models to compute\nanomaly scores. We evaluate this implementation on the\ntask of detecting anomalous usage of Android APIs. Our\nencouraging experimental results show that the method can\nautomatically discover subtle errors in Android applications\nin the wild, and has high precision and recall compared to\ncompeting probabilistic approaches.

\n", "tags": ["program analysis", "API"], "tsne_embedding": [23.70669937133789, 13.708314895629883]}, {"key": "nadeem2022codedsi", "year": "2022", "title": "CodeDSI: Differentiable Code Search", "abstract": "

Reimplementing solutions to previously solved software engineering problems is not only inefficient but also introduces inadequate and error-prone code. Many existing methods achieve impressive performance on this issue by using autoregressive text-generation models trained on code. However, these methods are not without their flaws. The generated code from these models can be buggy, lack documentation, and introduce vulnerabilities that may go unnoticed by developers. An alternative to code generation \u2013 neural code search \u2013 is a field of machine learning where a model takes natural language queries as input and, in turn, relevant code samples from a database are returned. Due to the nature of this pre-existing database, code samples can be documented, tested, licensed, and checked for vulnerabilities before being used by developers in production. In this work, we present CodeDSI, an end-to-end unified approach to code search. CodeDSI is trained to directly map natural language queries to their respective code samples, which can be retrieved later. In an effort to improve the performance of code search, we have investigated docid representation strategies, impact of tokenization on docid structure, and dataset sizes on overall code search performance. Our results demonstrate CodeDSI strong performance, exceeding conventional robust baselines by 2-6% across varying dataset sizes.

\n", "tags": ["search"], "tsne_embedding": [-4.072489261627197, -15.28984546661377]}, {"key": "naik2022probing", "year": "2022", "title": "Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis", "abstract": "

Representational Similarity Analysis is a method from cognitive neuroscience, which helps in comparing representations from two different sources of data. In this paper, we propose using Representational Similarity Analysis to probe the semantic grounding in language models of code. We probe representations from the CodeBERT model for semantic grounding by using the data from the IBM CodeNet dataset. Through our experiments, we show that current pre-training methods do not induce semantic grounding in language models of code, and instead focus on optimizing form-based patterns. We also show that even a little amount of fine-tuning on semantically relevant tasks increases the semantic grounding in CodeBERT significantly. Our ablations with the input modality to the CodeBERT model show that using bimodal inputs (code and natural language) over unimodal inputs (only code) gives better semantic grounding and sample efficiency during semantic fine-tuning. Finally, our experiments with semantic perturbations in code reveal that CodeBERT is able to robustly distinguish between semantically correct and incorrect code.

\n", "tags": ["interpretability", "language model", "evaluation", "Transformer"], "tsne_embedding": [-4.601995944976807, -6.317300319671631]}, {"key": "nair2020funcgnn", "year": "2020", "title": "funcGNN: A Graph Neural Network Approach to Program Similarity", "abstract": "

Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. We introduce funcGNN, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between high-level language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (0.00194), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with the state of the art methods. funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions.

\n", "tags": ["GNN", "clone"], "tsne_embedding": [2.8993752002716064, -6.66222620010376]}, {"key": "nguyen2013lexical", "year": "2013", "title": "Lexical Statistical Machine Translation for Language Migration", "abstract": "

Prior research has shown that source code also exhibits naturalness, i.e. it is written by humans and is likely to be\nrepetitive. The researchers also showed that the n-gram language model is useful in predicting the next token in a source\nfile given a large corpus of existing source code. In this paper, we investigate how well statistical machine translation\n(SMT) models for natural languages could help in migrating source code from one programming language to another.\nWe treat source code as a sequence of lexical tokens and\napply a phrase-based SMT model on the lexemes of those\ntokens. Our empirical evaluation on migrating two Java\nprojects into C# showed that lexical, phrase-based SMT\ncould achieve high lexical translation accuracy ( BLEU from\n81.3-82.6%). Users would have to manually edit only 11.9-15.8% of the total number of tokens in the resulting code to\ncorrect it. However, a high percentage of total translation\nmethods (49.5-58.6%) is syntactically incorrect. Therefore,\nour result calls for a more program-oriented SMT model that\nis capable of better integrating the syntactic and semantic\ninformation of a program to support language migration.

\n", "tags": ["migration", "API"], "tsne_embedding": [4.283272743225098, -21.722888946533203]}, {"key": "nguyen2013statistical", "year": "2013", "title": "A Statistical Semantic Language Model for Source Code", "abstract": "

Recent research has successfully applied the statistical n-gram language model to show that source code exhibits a\ngood level of repetition. The n-gram model is shown to have\ngood predictability in supporting code suggestion and completion. However, the state-of-the-art n-gram approach to\ncapture source code regularities/patterns is based only on\nthe lexical information in a local context of the code units.\nTo improve predictability, we introduce SLAMC, a novel statistical semantic language model for source code. It incorporates semantic information into code tokens and models the\nregularities/patterns of such semantic annotations, called sememes, rather than their lexemes. It combines the local context in semantic n-grams with the global technical concerns/functionality into an n-gram topic model, together with pairwise associations of program elements. Based on SLAMC,\nwe developed a new code suggestion method, which is empirically evaluated on several projects to have relatively 18\u201368%\nhigher accuracy than the state-of-the-art approach.

\n\n", "tags": ["language model"], "tsne_embedding": [-11.268786430358887, -18.115978240966797]}, {"key": "nguyen2013study", "year": "2013", "title": "A Study of Repetitiveness of Code Changes in Software Evolution", "abstract": "

In this paper, we present a large-scale study of\nrepetitiveness of code changes in software evolution. We collected\na large data set of 2,841 Java projects, with 1.7 billion source lines\nof code (SLOC) at the latest revisions, 1.8 million code change\nrevisions (0.4 million fixes), 6.2 million changed files, and 2.5\nbillion changed SLOCs. A change is considered repeated within\nor cross-project if it matches another change having occurred\nin the history of the project or another project, respectively. We\nreport the following important findings. First, repetitiveness of\nchanges could be as high as 70\u2013100% at small sizes and decreases\nexponentially as size increases. Second, repetitiveness is higher\nand more stable in the cross-project setting than in the project-within one. Third, fixing changes repeat similarly to general\nchanges. Importantly, learning code changes and recommending\nthem in software evolution is beneficial with accuracy for top-1\nrecommendation of over 30% and top-3 of nearly 35%. Repeated\nfixing changes could also be useful for automatic program repair.

\n\n", "tags": ["edit"], "tsne_embedding": [18.73570442199707, -10.29053020477295]}, {"key": "nguyen2014statistical", "year": "2014", "title": "Statistical Learning Approach for Mining API Usage Mappings for Code Migration", "abstract": "

The same software product nowadays could appear in multiple platforms and devices. To address business needs, software companies\ndevelop a software product in a programming language and then\nmigrate it to another one. To support that process, semi-automatic\nmigration tools have been proposed. However, they require users\nto manually define the mappings between the respective APIs of\nthe libraries used in two languages. To reduce such manual effort,\nwe introduce StaMiner, a novel data-driven approach that statistically learns the mappings between APIs from the corpus of the\ncorresponding client code of the APIs in two languages Java and\nC#. Instead of using heuristics on the textual or structural similarity\nbetween APIs in two languages to map API methods and classes\nas in existing mining approaches, StaMiner is based on a statistical\nmodel that learns the mappings in such a corpus and provides mappings for APIs with all possible arities. Our empirical evaluation\non several projects shows that StaMiner can detect API usage mappings with higher accuracy than a state-of-the-art approach. With\nthe resulting API mappings mined by StaMiner, Java2CSharp, an\nexisting migration tool, could achieve a higher level of accuracy.

\n", "tags": ["migration", "API"], "tsne_embedding": [6.9685845375061035, -18.484832763671875]}, {"key": "nguyen2015divide", "year": "2014", "title": "Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code", "abstract": "

Prior research shows that directly applying phrase-based SMT on lexical tokens to migrate Java to C# produces\nmuch semantically incorrect code. A key limitation is the use of\nsequences in phrase-based SMT to model and translate source\ncode with well-formed structures. We propose mppSMT, a divideand-conquer technique to address that with novel training and migration algorithms using phrase-based SMT in three phases. First,\nmppSMT treats a program as a sequence of syntactic units and\nmaps/translates such sequences in two languages to one another.\nSecond, with a syntax-directed fashion, it deals with the tokens\nwithin syntactic units by encoding them with semantic symbols to\nrepresent their data and token types. This encoding via semantic\nsymbols helps better migration of API usages. Third, the lexical\ntokens corresponding to each sememe are mapped or migrated.\nThe resulting sequences of tokens are merged together to form\nthe final migrated code. Such divide-and-conquer and syntax-direction strategies enable phrase-based SMT to adapt well to\nsyntactical structures in source code, thus, improving migration\naccuracy. Our empirical evaluation on several real-world systems\nshows that 84.8\u201397.9% and 70\u201383% of the migrated methods are\nsyntactically and semantically correct, respectively. 26.3\u201351.2%\nof total migrated methods are exactly matched to the human-written C# code in the oracle. Compared to Java2CSharp, a rule-based migration tool, it achieves higher semantic accuracy from\n6.6\u201357.7% relatively. Importantly, it does not require manual\nlabeling for training data or manual definition of rules.

\n", "tags": ["migration"], "tsne_embedding": [4.813498020172119, -21.195865631103516]}, {"key": "nguyen2015graph", "year": "2015", "title": "Graph-based Statistical Language Model for Code", "abstract": "

n-gram statistical language model has been successfully applied to capture programming patterns to support code\ncompletion and suggestion. However, the approaches using n-gram face challenges in capturing the patterns at higher levels\nof abstraction due to the mismatch between the sequence nature\nin n-grams and the structure nature of syntax and semantics\nin source code. This paper presents GraLan, a graph-based\nstatistical language model and its application in code suggestion. GraLan can learn from a source code corpus and compute\nthe appearance probabilities of any graphs given the observed\n(sub)graphs. We use GraLan to develop an API suggestion\nengine and an AST-based language model, ASTLan. ASTLan\nsupports the suggestion of the next valid syntactic template\nand the detection of common syntactic templates. Our empirical\nevaluation on a large corpus of open-source projects has shown\nthat our engine is more accurate in API code suggestion than\nthe state-of-the-art approaches, and in 75% of the cases, it can\ncorrectly suggest the API with only five candidates. ASTLan also\nhas high accuracy in suggesting the next syntactic template and\nis able to detect many useful and common syntactic templates.

\n", "tags": ["representation", "language model", "autocomplete"], "tsne_embedding": [-13.237197875976562, -16.87152862548828]}, {"key": "nguyen2016learning", "year": "2016", "title": "Learning API Usages from Bytecode: A Statistical Approach", "abstract": "

Mobile app developers rely heavily on standard API frameworks and libraries. However, learning API usages is often challenging due to the fast-changing nature of API frameworks for mobile systems and the insufficiency of API documentation and source code examples. In this paper, we propose a novel approach to learn API usages from bytecode of Android mobile apps. Our core contributions include HAPI, a statistical model of API usages and three algorithms to extract method call sequences from apps\u2019 bytecode, to train HAPI based on those sequences, and to recommend method calls in code completion using the trained HAPIs. Our empirical evaluation shows that our prototype tool can effectively learn API usages from 200 thousand apps containing 350 million method sequences. It recommends next method calls with top-3 accuracy of 90% and outperforms baseline approaches on average 10-20%.

\n", "tags": ["representation", "API"], "tsne_embedding": [8.459501266479492, -19.709064483642578]}, {"key": "nguyen2016mapping", "year": "2016", "title": "Mapping API Elements for Code Migration with Vector Representations", "abstract": "

Mapping API elements has a significant role in software development, especially in code migration. A manual process of defining the migration is tedious and error-prone while recent approaches to automatically mine API mappings are limited to discover the mappings with textually similar APIs\u2019 names. This leads to the low accuracy in existing migration tools.We propose an approach to automatically mine API mappings which overcomes the lexical mismatch problem. We represent an API by its usages instead of its name.To characterize an API with its context consisting of surrounding APIs in its usages, we take advantage of Word2Vec model to project the APIs of Java JDK and C# .NET into corresponding continuous vector spaces. The semantic relations among APIs will be observed in those continuous space as the geometric arrangements between their representation vectors in two vector spaces.We use a learning approach to derive the linear (e.g., rotating and scaling) transformation function between two vector spaces. Transformation function is trained from human-defined pairs of API mappings from Java to C#. To find the C# API mapping with a given Java API, we use the learned function to compute its transformed vector in the C# vector space. Then, the C# API which has the most similar vector with the transformed vector is considered as the result. Our experiment shows that for just one suggestion, we are able to correctly derive the API in C# in almost 43% of the cases. With 5 suggestions, we can correctly suggest the correct C# API in almost 3 out of 4 cases (73.2%).

\n", "tags": ["migration", "API"], "tsne_embedding": [6.028432369232178, -18.401092529296875]}, {"key": "nguyen2017exploring", "year": "2017", "title": "Exploring API Embedding for API Usages and Applications", "abstract": "

Word2Vec is a class of neural network models that\nas being trained from a large corpus of texts, they can produce for\neach unique word a corresponding vector in a continuous space in\nwhich linguistic contexts of words can be observed. In this work,\nwe study the characteristics of Word2Vec vectors, called API 2 VEC\nor API embeddings, for the API elements within the API sequences in source code. Our empirical study shows that the close\nproximity of the API 2 VEC vectors for API elements reflects the\nsimilar usage contexts containing the surrounding APIs of those\nAPI elements. Moreover, API 2 VEC can capture several similar\nsemantic relations between API elements in API usages via vector\noffsets. We demonstrate the usefulness of API 2 VEC vectors for\nAPI elements in three applications. First, we build a tool that mines the pairs of API elements that share the same usage relations\namong them. The other applications are in the code migration\ndomain. We develop API 2 API , a tool to automatically learn the\nAPI mappings between Java and C# using a characteristic of the\nAPI 2 VEC vectors for API elements in the two languages: semantic\nrelations among API elements in their usages are observed in the\ntwo vector spaces for the two languages as similar geometric\narrangements among their API 2 VEC vectors. Our empirical\nevaluation shows that API 2 API relatively improves 22.6% and\n40.1% top-1 and top-5 accuracy over a state-of-the-art mining\napproach for API mappings. Finally, as another application in\ncode migration, we are able to migrate equivalent API usages\nfrom Java to C# with up to 90.6% recall and 87.2% precision.

\n", "tags": ["API", "representation"], "tsne_embedding": [5.934948921203613, -17.663711547851562]}, {"key": "nguyen2019graph", "year": "2019", "title": "Graph-based Mining of In-the-Wild, Fine-grained, Semantic Code Change Patterns", "abstract": "

Existing approaches for detecting repetitive code changes relying on syntactic similarity cannot effectively detect semantic change patterns. In this work, we introduce a novel graph-based mining approach, CPatMiner, which is capable of detecting semantic code change patterns from a large number of open-source repositories by capturing dependencies between fine-grained change elements. We evaluated CPatMiner by mining change patterns in a diverse corpus of 5,000+ open-source projects from GitHub with 170,000+ developers. We use three complementary methods. First, we sent the mined patterns to the authors and received 108 responses. 70% of respondents recognized those patterns as their meaningful frequent changes. 79% of respondents even named the patterns, and 44% wanted IDEs to automate such repetitive changes. The mined patterns belong to various activities: adaptive (9%), perfective (20%), corrective (35%) and preventive (36%). Second, we compared CPatMiner with the state-of-the-art, AST-based technique, and reported that CPatMiner detects 2.1x more meaningful patterns. Third, we used CPatMiner to search for patterns in a corpus of 88 GitHub projects with longer histories consisting of 164M SLOCs. It constructed 322K fine-grained change graphs containing 3M nodes, and detected 17K change patterns which provide unique insights on the practice of change patterns among individuals and teams. We found that a large percentage (75%) of the patterns from individual developers are commonly shared with others, and this holds true for teams. Moreover, we found that the patterns spread widely over time. Thus, we call for a community-based change pattern database to provide important resources in novel applications.

\n", "tags": ["edit", "pattern mining"], "tsne_embedding": [18.4116268157959, -10.740983963012695]}, {"key": "nguyen2020suggesting", "year": "2020", "title": "Suggesting Natural Method Names to Check Name Consistencies", "abstract": "

Misleading names of the methods in a project or the APIs in a software library confuse developers about program functionality\nand API usages, leading to API misuses and defects. In this paper,we introduce MNire, a machine learning approach to check the\nconsistency between the name of a given method and its implementation. MNire first generates a candidate name and compares the\ncurrent name against it. If the two names are sufficiently similar, we consider the method as consistent. To generate the method name,\nwe draw our ideas and intuition from an empirical study on the nature of method names in a large dataset. Our key finding is that\nhigh proportions of the tokens of method names can be found in the three contexts of a given method including its body,\nthe interface (the method\u2019s parameter types and return type), and the enclosing class\u2019 name. Even when such tokens are not there,\nMNire uses the contexts to predict the tokens due to the high likelihoods of their co-occurrences. Our unique idea is to treat\nthe name generation as an abstract summarization on the tokens collected from the names of the program entities in the three\nabove contexts.

\n\n

We conducted several experiments to evaluate MNire in method name consistency checking and in method name\nrecommending on large datasets with +14M methods. In detecting inconsistency method names, MNire improves the state-of-the-art\napproach by 10.4% and 11% relatively in recall and precision, respectively. In method name recommendation, MNire improves relatively\nover the state-of-the-art technique, code2vec, in both recall (18.2% higher) and precision (11.1% higher). To assess MNire\u2019s usefulness,\nwe used it to detect inconsistent methods and suggest new names in several active, GitHub projects. We made 50 pull requests (PRs) and received\n42 responses. Among them, five PRs were merged into the main branch, and 13 were approved for later merging. In total, in 31/42 cases,\nthe developer teams agree that our suggested names are more meaningful than the current names, showing MNire\u2019s usefulness.

\n", "tags": ["naming"], "tsne_embedding": [12.665807723999023, -8.215391159057617]}, {"key": "nie2021evaluation", "year": "2021", "title": "Impact of Evaluation Methodologies on Code Summarization", "abstract": "

There has been a growing interest in developing machine learning (ML) models for code summarization tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and test sets, were not well studied. Specifically, no prior work on code summarization considered the timestamps of code and comments during evaluation. This may lead to evaluations that are inconsistent with the intended use cases. In this paper, we introduce the time-segmented evaluation methodology, which is novel to the code summarization research community, and compare it with the mixed-project and cross-project methodologies that have been commonly used. Each methodology can be mapped to some use cases, and the time-segmented methodology should be adopted in the evaluation of ML models for code summarization. To assess the impact of methodologies, we collect a dataset of (code, comment) pairs with timestamps to train and evaluate several recent ML models for code summarization. Our experiments show that different methodologies lead to conflicting evaluation results. We invite the community to expand the set of methodologies used in evaluations.

\n", "tags": ["evaluation", "dataset"], "tsne_embedding": [-17.17341423034668, -11.479129791259766]}, {"key": "nijkamp2022conversational", "year": "2022", "title": "A Conversational Paradigm for Program Synthesis", "abstract": "

Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI\u2019s Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.

\n", "tags": ["Transformer", "synthesis"], "tsne_embedding": [5.311823844909668, 3.48262095451355]}, {"key": "nijkamp2023codegen2", "year": "2023", "title": "CodeGen2: Lessons for Training LLMs on Programming and Natural Languages", "abstract": "

Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly.

\n\n

In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a \u201cfree lunch\u201d hypothesis. For data distributions, the effect of a mixture distribution of programming and natural languages on model performance is explored.

\n\n

We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into four lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen2

\n", "tags": ["Transformer"], "tsne_embedding": [1.1916695833206177, 5.604220390319824]}, {"key": "nitin2021direct", "year": "2021", "title": "DIRECT : A Transformer-based Model for Decompiled Identifier Renaming", "abstract": "

Decompiling binary executables to high-level code is an important step in reverse engineering scenarios, such as malware analysis and legacy code maintenance. However, the generated high-level code is difficult to understand since the original variable names are lost. In this paper, we leverage transformer models to reconstruct the original variable names from decompiled code. Inherent differences between code and natural language present certain challenges in applying conventional transformer-based architectures to variable name recovery. We propose DIRECT, a novel transformer-based architecture customized specifically for the task at hand. We evaluate our model on a dataset of decompiled functions and find that DIRECT outperforms the previous state-of-the-art model by up to 20%. We also present ablation studies evaluating the impact of each of our modifications. We make the source code of DIRECT available to encourage reproducible research.

\n", "tags": ["Transformer", "decompilation"], "tsne_embedding": [14.794001579284668, 17.546367645263672]}, {"key": "niu2022spt-code", "year": "2022", "title": "SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations", "abstract": "

Recent years have seen the successful application of large pre-trained modelsto code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding theirapplication to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pre-training tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existingpre-training tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstreamt asks. Experimental results demonstrate that SPT-Code achieves state-of-the-artperformance on five code-related downstream tasks after fine-tuning.

\n", "tags": ["Transformer", "representation"], "tsne_embedding": [-2.834887981414795, -2.1342999935150146]}, {"key": "nye2021program", "year": "2021", "title": "Program Synthesis with Large Language Models", "abstract": "

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model\u2019s ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model\u2019s initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

\n", "tags": ["Transformer", "synthesis"], "tsne_embedding": [5.133862495422363, 3.0422708988189697]}, {"key": "nye2021show", "year": "2021", "title": "Show Your Work: Scratchpads for Intermediate Computation with Language Models", "abstract": "

Large pre-trained language models perform remarkably well on tasks that can be done \u201cin one pass\u201d, such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations \u2013 even in the few-shot regime \u2013 when asked to perform the operation \u201cstep by step\u201d, showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a \u201cscratchpad\u201d. On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.

\n", "tags": ["Transformer", "execution"], "tsne_embedding": [6.061630725860596, 4.436333179473877]}, {"key": "oda2015learning", "year": "2015", "title": "Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation", "abstract": "

Pseudo-code written in natural language can aid\nthe comprehension of source code in unfamiliar programming\nlanguages. However, the great majority of source code has no\ncorresponding pseudo-code, because pseudo-code is redundant\nand laborious to create. If pseudo-code could be generated\nautomatically and instantly from given source code, we could\nallow for on-demand production of pseudo-code without human\neffort. In this paper, we propose a method to automatically\ngenerate pseudo-code from source code, specifically adopting the\nstatistical machine translation (SMT) framework. SMT, which\nwas originally designed to translate between two natural languages, allows us to automatically learn the relationship between\nsource code/pseudo-code pairs, making it possible to create a\npseudo-code generator with less human effort. In experiments,\nwe generated English or Japanese pseudo-code from Python\nstatements using SMT, and find that the generated pseudo-code\nis largely accurate, and aids code understanding.

\n", "tags": ["representation", "bimodal", "grammar"], "tsne_embedding": [2.659187078475952, -22.769615173339844]}, {"key": "oh2015learning", "year": "2015", "title": "Learning a Strategy for Adapting a Program Analysis via Bayesian Optimisation", "abstract": "

Building a cost-effective static analyser for real-world programs is still regarded an art. One key contributor to this\ngrim reputation is the difficulty in balancing the cost and the\nprecision of an analyser. An ideal analyser should be adap-\ntive to a given analysis task, and avoid using techniques that\nunnecessarily improve precision and increase analysis cost.\nHowever, achieving this ideal is highly nontrivial, and it requires a large amount of engineering efforts.

\n\n

In this paper we present a new approach for building\nan adaptive static analyser. In our approach, the analyser\nincludes a sophisticated parameterised strategy that decides, for each part of a given program, whether to apply\na precision-improving technique to that part or not. We\npresent a method for learning a good parameter for such\na strategy from an existing codebase via Bayesian optimisation. The learnt strategy is then used for new, unseen programs. Using our approach, we developed partially flow-\nand context-sensitive variants of a realistic C static analyser.\nThe experimental results demonstrate that using Bayesian\noptimisation is crucial for learning from an existing codebase. Also, they show that among all program queries that\nrequire flow- or context-sensitivity, our partially flow- and\ncontext-sensitive analysis answers the 75% of them, while\nincreasing the analysis cost only by 3.3x of the baseline\nflow- and context-insensitive analysis, rather than 40x or\nmore of the fully sensitive version.

\n", "tags": ["program analysis"], "tsne_embedding": [21.21198081970215, 11.992541313171387]}, {"key": "olausson2023demystifying", "year": "2023", "title": "Demystifying GPT Self-Repair for Code Generation", "abstract": "

Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair \u2013 in which the model debugs and fixes mistakes in its own code \u2013 has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4\u2019s ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.

\n", "tags": ["repair"], "tsne_embedding": [13.316444396972656, 0.008338917046785355]}, {"key": "omar2013structured", "year": "2013", "title": "Structured Statistical Syntax Tree Prediction", "abstract": "

Statistical models of source code can be used to improve\ncode completion systems, assistive interfaces, and code\ncompression engines. We are developing a statistical model\nwhere programs are represented as syntax trees, rather than\nsimply a stream of tokens. Our model, initially for the Java\nlanguage, combines corpus data with information about syntax, types and the program context. We tested this model\nusing open source code corpuses and find that our model\nis significantly more accurate than the current state of the\nart, providing initial evidence for our claim that combining\nstructural and statistical information is a fruitful strategy.

\n", "tags": ["language model", "grammar"], "tsne_embedding": [-12.853593826293945, -17.06854248046875]}, {"key": "orlanski2021reading", "year": "2021", "title": "Reading StackOverflow Encourages Cheating: Adding Question Text Improves Extractive Code Generation", "abstract": "

Answering a programming question with only its title is difficult as salient contextual information is left out. To address this, we present a corpus of over 40,000 StackOverflow question texts to be used in conjunction with the corresponding intents from the CoNaLa dataset (Yin et al., 2018). Using both the intent and the question body, we use BART to establish a baseline BLEU score of 34.35 for this new task. We then find further improvements of 2.8% by combining the mined CoNaLa data with the labeled data to achieve a 35.32 BLEU score. We then evaluate the prior state-of-the-art CoNaLa models with this additional data. We find that our proposed method of using the body and mined data beats that of the previous state-of-the-art by a 71.96% BLEU score. Finally, we perform ablations that prove that BART is an unsupervised multimodal learner and examine its extractive behavior.

\n", "tags": ["dataset", "Transformer"], "tsne_embedding": [-5.327184677124023, -8.576211929321289]}, {"key": "ott2018deep", "year": "2018", "title": "A Deep Learning Approach to Identifying Source Code in Images and Video", "abstract": "

While substantial progress has been made in mining code on an\nInternet scale, efforts to date have been overwhelmingly focused on\ndata sets where source code is represented natively as text. Large\nvolumes of source code available online and embedded in technical\nvideos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing\napproaches to code extraction and indexing in this environment rely\nheavily on computationally intense optical character recognition.\nTo improve the ease and efficiency of identifying this embedded\ncode, as well as identifying similar code examples, we develop a\ndeep learning solution based on convolutional neural networks and\nautoencoders. Focusing on Java for proof of concept, our technique\nis able to identify the presence of typeset and handwritten source\ncode in thousands of video images with 85.6%-98.6% accuracy based\non syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides\na more scalable basis for video indexing that can be incorporated\ninto existing software search and mining tools.

\n", "tags": ["information extraction"], "tsne_embedding": [-3.8713810443878174, 19.352231979370117]}, {"key": "pandi2020opttyper", "year": "2020", "title": "OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints", "abstract": "

We present a new approach to the type inference problem for dynamic languages. Our goal is to combine logical constraints, that is, deterministic information from a type system, with natural constraints, uncertain information about types from sources like identifier names. To this end, we introduce a framework for probabilistic type inference that combines logic and learning: logical constraints on the types are extracted from the program, and deep learning is applied to predict types from surface-level code properties that are statistically associated, such as variable names. The main insight of our method is to constrain the predictions from the learning procedure to respect the logical constraints, which we achieve by relaxing the logical inference problem of type prediction into a continuous optimisation problem. To evaluate the idea, we built a tool called OptTyper to predict a TypeScript declaration file for a JavaScript library. OptTyper combines a continuous interpretation of logical constraints derived by a simple program transformation and static analysis of the JavaScript code, with natural constraints obtained from a deep learning model, which learns naming conventions for types from a large codebase. We evaluate OptTyper on a data set of 5,800 open-source JavaScript projects that have type annotations in the well-known DefinitelyTyped repository. We find that combining logical and natural constraints yields a large improvement in performance over either kind of information individually, and produces 50% fewer incorrect type predictions than previous approaches.

\n", "tags": ["types", "bimodal"], "tsne_embedding": [-3.0805985927581787, 26.84416961669922]}, {"key": "panthaplackel2020associating", "year": "2020", "title": "Associating Natural Language Comment and Source Code Entities", "abstract": "

Comments are an integral part of software development; they are natural language descriptions associated with source code elements. Understanding explicit associations can be useful in improving code comprehensibility and maintaining the consistency between code and comments. As an initial step towards this larger goal, we address the task of associating entities in Javadoc comments with elements in Java source code. We propose an approach for automatically extracting supervised data using revision histories of open source projects and present a manually annotated evaluation dataset for this task. We develop a binary classifier and a sequence labeling model by crafting a rich feature set which encompasses various aspects of code, comments, and the relationships between them. Experiments show that our systems outperform several baselines learning from the proposed supervision.

\n", "tags": ["dataset", "bimodal"], "tsne_embedding": [-15.174789428710938, -2.191948652267456]}, {"key": "panthaplackel2020copy", "year": "2020", "title": "Copy that! Editing Sequences by Copying Spans", "abstract": "

Neural sequence-to-sequence models are finding increasing use in editing of documents, for example in correcting a text document or repairing source code. In this paper, we argue that common seq2seq models (with a facility to copy single tokens) are not a natural fit for such tasks, as they have to explicitly copy each unchanged token. We present an extension of seq2seq models capable of copying entire spans of the input to the output in one step, greatly reducing the number of decisions required during inference. This extension means that there are now many ways of generating the same output, which we handle by deriving a new objective for training and a variation of beam search for inference that explicitly handle this problem.

\n\n

In our experiments on a range of editing tasks of natural language and source code, we show that our new model consistently outperforms simpler baselines.

\n", "tags": ["edit"], "tsne_embedding": [-11.943350791931152, -1.1401959657669067]}, {"key": "panthaplackel2020deep", "year": "2020", "title": "Deep Just-In-Time Inconsistency Detection Between Comments and Source Code", "abstract": "

Natural language comments convey key aspects of source code such as implementation, usage, and pre- and post-conditions. Failure to update comments accordingly when the corresponding code is modified introduces inconsistencies, which is known to lead to confusion and software bugs. In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code, in order to catch potential inconsistencies just-in-time, i.e., before they are committed to a version control system. To achieve this, we develop a deep-learning approach that learns to correlate a comment with code changes. By evaluating on a large corpus of comment/code pairs spanning various comment types, we show that our model outperforms multiple baselines by significant margins. For extrinsic evaluation, we show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system which can both detect and resolve inconsistent comments based on code changes.

\n", "tags": ["edit", "bimodal", "documentation"], "tsne_embedding": [-15.169079780578613, -0.6792193055152893]}, {"key": "panthaplackel2020learning", "year": "2020", "title": "Learning to Update Natural Language Comments Based on Code Changes", "abstract": "

We formulate the novel task of automatically updating an existing natural language comment based on changes in the body of code it accompanies. We propose an approach that learns to correlate changes across two distinct language representations, to generate a sequence of edits that are applied to the existing comment to reflect the source code modifications. We train and evaluate our model using a dataset that we collected from commit histories of open-source software projects, with each example consisting of a concurrent update to a method and its corresponding comment. We compare our approach against multiple baselines using both automatic metrics and human evaluation. Results reflect the challenge of this task and that our model outperforms baselines with respect to making edits.

\n", "tags": ["bimodal", "edit", "documentation"], "tsne_embedding": [-14.930514335632324, -0.4222463071346283]}, {"key": "panthaplackel2021learning", "year": "2021", "title": "Learning to Describe Solutions for Bug Reports Based on Developer Discussions", "abstract": "

When a software bug is reported, developers engage in a discussion to collaboratively resolve it. While the solution is likely formulated within the discussion, it is often buried in a large amount of text, making it difficult to comprehend, which delays its implementation. To expedite bug resolution, we propose generating a concise natural language description of the solution by synthesizing relevant content within the discussion, which encompasses both natural language and source code. Furthermore, to support generating an informative description during an ongoing discussion, we propose a secondary task of determining when sufficient context about the solution emerges in real-time. We construct a dataset for these tasks with a novel technique for obtaining noisy supervision from repository changes linked to bug reports. We establish baselines for generating solution descriptions, and develop a classifier which makes a prediction following each new utterance on whether or not the necessary context for performing generation is available. Through automated and human evaluation, we find these tasks to form an ideal testbed for complex reasoning in long, bimodal dialogue context.

\n", "tags": ["summarization", "documentation"], "tsne_embedding": [-19.769411087036133, 5.470187664031982]}, {"key": "panthaplackel2022using", "year": "2022", "title": "Using Developer Discussions to Guide Fixing Bugs in Software", "abstract": "

Automatically fixing software bugs is a challenging task. While recent work showed that natural language context is useful in guiding bug-fixing models, the approach required prompting developers to provide this context, which was simulated through commit messages written after the bug-fixing code changes were made. We instead propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for any additional information from developers. For this, we augment standard bug-fixing datasets with bug report discussions. Using these newly compiled datasets, we demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.

\n", "tags": ["Transformer", "repair"], "tsne_embedding": [-19.838638305664062, 5.579164505004883]}, {"key": "parisi2021source", "year": "2021", "title": "Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers", "abstract": "

The analysis of source code through machine learning techniques is an increasingly explored research topic aiming at increasing smartness in the software toolchain to exploit modern architectures in the best possible way. In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption. Depending on the kernel to be executed, the energy optimal scaling configuration is not trivial. While recent work has focused on general-purpose systems to learn and predict the best execution target in terms of the execution time of a snippet of code or kernel (e.g. offload OpenCL kernel on multicore CPU or GPU), in this work we focus on static compile-time features to assess if they can be successfully used to predict the minimum energy configuration on PULP, an ultra-low-power architecture featuring an on-chip cluster of RISC-V processors. Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation.

\n", "tags": ["optimization", "program analysis"], "tsne_embedding": [2.550934314727783, 19.175148010253906]}, {"key": "parisi2022making", "year": "2022", "title": "Making the Most of Scarce Input Data in Deep Learning-Based Source Code Classification for Heterogeneous Device Mapping", "abstract": "

Despite its relatively recent history, deep learning (DL)-based source code analysis is already a cornerstone in machine learning for compiler optimization. When applied to the classification of pieces of code to identify the best computational unit in a heterogeneous Systems-on-Chip, it can be effective in supporting decisions that a programmer has otherwise to take manually. Several techniques have been proposed exploiting different networks and input information, prominently sequence-based and graph-based representations, complemented by auxiliary information typically related to payload and device configuration. While the accuracy of DL methods strongly depends on the training and test datasets, so far no exhaustive and statistically meaningful analysis has been done on its impact on the results and on how to effectively extract the available information. This is relevant also considering the scarce availability of source code datasets that can be labeled by profiling on heterogeneous compute units. In this article, we first present such a study, which leads us to devise the contribution of code sequences and auxiliary inputs separately. Starting from this analysis, we then demonstrate that by using the normalization of auxiliary information, it is possible to improve state-of-the-art results in terms of accuracy. Finally, we propose a novel approach exploiting Siamese networks that further improve mapping accuracy by increasing the cardinality of the dataset, thus compensating for its relatively small size.

\n", "tags": ["optimization", "program analysis", "static analysis", "language model"], "tsne_embedding": [2.3750646114349365, 17.7185115814209]}, {"key": "parvez2018building", "year": "2018", "title": "Building Language Models for Text with Named Entities", "abstract": "

Text in many domains involves a significant amount of named entities. Predicting the entity names is often challenging\nfor a language model as they appear less\nfrequent on the training corpus. In this\npaper, we propose a novel and effective\napproach to building a discriminative language model which can learn the entity\nnames by leveraging their entity type information. We also introduce two benchmark datasets based on recipes and Java\nprogramming codes, on which we evaluate the proposed model. Experimental results show that our model achieves 52.2%\nbetter perplexity in recipe generation and\n22.06% on code generation than the state-of-the-art language models.

\n", "tags": ["language model"], "tsne_embedding": [-8.18537425994873, -6.6959991455078125]}, {"key": "parvez2021retrieval", "year": "2021", "title": "Retrieval Augmented Code Generation and Summarization", "abstract": "

Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers\u2019 code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. REDCODER has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework.

\n", "tags": ["Transformer", "summarization", "code generation"], "tsne_embedding": [-12.935811042785645, -10.811586380004883]}, {"key": "pashakhanloo2022codetrek", "year": "2022", "title": "CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation", "abstract": "

Designing a suitable representation for code-reasoning tasks is challenging in aspects such as the kinds of program information to model, how to combine them, and how much context to consider. We propose CodeTrek, a deep learning approach that addresses these challenges by representing codebases as databases that conform to rich relational schemas. The relational representation not only allows CodeTrek to uniformly represent diverse kinds of program information, but also to leverage program-analysis queries to derive new semantic relations, which can be readily incorporated without further architectural engineering. CodeTrek embeds this relational representation using a set of walks that can traverse different relations in an unconstrained fashion, and incorporates all relevant attributes along the way. We evaluate CodeTrek on four diverse and challenging Python tasks: variable misuse, exception prediction, unused definition, and variable shadowing.\nCodeTrek achieves an accuracy of 91%, 63%, 98%, and 94% on these tasks respectively, and outperforms state-of-the-art neural models by 2-19% points.

\n", "tags": ["representation", "variable misuse"], "tsne_embedding": [0.391126811504364, 11.519550323486328]}, {"key": "patil2022exploring", "year": "2022", "title": "Exploring Dimensions of Generalizability and Few-shot Transfer for Text-to-SQL Semantic Parsing", "abstract": "

Existing work on generalization in Text-to-SQL semantic parsing has been restricted to a zero-shot cross-domain setting. In this paper, we introduce Spider-Gen: a Text-to-SQL benchmark to develop a paradigm of transfer learning across distinct dimensions of generalization in Text-to-SQL semantic parsing. The Spider-Gen benchmark focuses on few-shot adaption for Cross-domain, Lexical, and Structural generalization of Text-to-SQL models. Through our experiments with the Spider-Gen dataset, we show that Seq2Seq language models struggle to generalize against change in data distribution, lexical changes in database schema, and changes in SQL query complexity. Our experiments also reveal that performing few-shot fine-tuning helps Text-to-SQL models to generalize across these changes. However, such few-shot adaptation comes with a negative effect on the knowledge learnt during training. Hence, we also explore Parameter-efficient Fine-tuning methods to overcome the limitations of Seq2Seq Text-to-SQL models. We release the Spider-Gen dataset publicly to facilitate further research in generalization and transfer learning across various dimensions in Text-to-SQL semantic parsing.

\n", "tags": ["dataset", "evaluation", "Transformer", "benchmark", "generalizability"], "tsne_embedding": [-20.091995239257812, -18.850696563720703]}, {"key": "patra2016learning", "year": "2016", "title": "Learning to Fuzz: Application-Independent Fuzz Testing with Probabilistic, Generative Models of Input Data", "abstract": "

Fuzzing is a popular technique to create test inputs for software that processes structured data. It has been successfully\napplied in various domains, ranging from compilers and interpreters over program analyses to rendering engines, image manipulation tools, and word processors. Existing fuzz\ntesting techniques are tailored for a particular purpose and\nrely on a carefully crafted model of the data to be generated.\nThis paper presents TreeFuzz, a generic approach for generating structured data without an a priori known model. The\nkey idea is to exploit a given corpus of example data to au-\ntomatically infer probabilistic, generative models that create\nnew data with properties similar to the corpus. To support a\nwide range of different properties, TreeFuzz is designed as a\nframework with an extensible set of techniques to infer generative models. We apply the idea to JavaScript programs\nand HTML documents and show that the approach generates mostly valid data for both of them: 96.3% of the generated JavaScript programs are syntactically valid and there are\nonly 2.06 validation errors per kilobyte of generated HTML.\nThe performance of both learning and generation scales linearly w.r.t. the size of the corpus. Using TreeFuzz-generated\nJavaScript programs for differential testing of JavaScript engines exposes various inconsistencies among browsers, including browser bugs and unimplemented language features.

\n", "tags": ["fuzzing"], "tsne_embedding": [18.817276000976562, 15.315947532653809]}, {"key": "patra2021semantic", "year": "2021", "title": "A Semantic Bug Seeding: A Learning-Based Approach for Creating Realistic Bugs", "abstract": "

When working on techniques to address the wide-spread problem\nof software bugs, one often faces the need for a large number of\nrealistic bugs in real-world programs. Such bugs can either help\nevaluate an approach, e.g., in form of a bug benchmark or a suite\nof program mutations, or even help build the technique, e.g., in\nlearning-based bug detection. Because gathering a large number ofreal bugs is difficult,\na common approach is to rely on automatically\nseeded bugs. Prior work seeds bugs based on syntactic transformation patterns,\nwhich often results in unrealistic bugs and typically \ncannot introduce new, application-specific code tokens. This paper\npresents SemSeed, a technique for automatically seeding bugs in\na semantics-aware way. The key idea is to imitate how a given\nreal-world bug would look like in other programs by semantically\nadapting the bug pattern to the local context. To reason about the\nsemantics of pieces of code, our approach builds on learned token embeddings\nthat encode the semantic similarities of identifiers and literals. Our\nevaluation with real-world JavaScript softwares\nhows that the approach effectively reproduces real bugs and clearly\noutperforms a semantics-unaware approach. The seeded bugs are\nuseful as training data for learning-based bug detection, where\nthey significantly improve the bug detection ability. Moreover, we\nshow that SemSeed-created bugs complement existing mutation\ntesting operators, and that our approach is efficient enough to seed\nhundreds of thousands of bugs within an hour.

\n", "tags": ["repair", "edit"], "tsne_embedding": [18.17028045654297, 4.892918586730957]}, {"key": "pearce2021empirical", "year": "2021", "title": "An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions", "abstract": "

There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described `AI pair programmer\u2019, GitHub Copilot, a language model trained over open-source GitHub code. However, code often contains bugs - and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot\u2019s code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis we prompt Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those from MITRE\u2019s \u201cTop 25\u201d list). We explore Copilot\u2019s performance on three distinct code generation axes \u2013 examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,692 programs. Of these, we found approximately 40% to be vulnerable.

\n", "tags": ["Transformer", "language model"], "tsne_embedding": [12.163616180419922, 7.028591632843018]}, {"key": "peng2021how", "year": "2021", "title": "How could Neural Networks understand Programs?", "abstract": "

Semantic understanding of programs is a fundamental problem for programming language processing (PLP). Recent works that learn representations of code based on pre-training techniques in NLP have pushed the frontiers in this direction. However, the semantics of PL and NL have essential differences. These being ignored, we believe it is difficult to build a model to better understand programs, by either directly applying off-the-shelf NLP pre-training techniques to the source code, or adding features to the model by the heuristic. In fact, the semantics of a program can be rigorously defined by formal semantics in PL theory. For example, the operational semantics, describes the meaning of a valid program as updating the environment (i.e., the memory address-value function) through fundamental operations, such as memory I/O and conditional branching. Inspired by this, we propose a novel program semantics learning paradigm, that the model should learn from information composed of (1) the representations which align well with the fundamental operations in operational semantics, and (2) the information of environment transition, which is indispensable for program understanding. To validate our proposal, we present a hierarchical Transformer-based pre-training model called OSCAR to better facilitate the understanding of programs. OSCAR learns from intermediate representation (IR) and an encoded representation derived from static analysis, which are used for representing the fundamental operations and approximating the environment transitions respectively. OSCAR empirically shows the outstanding capability of program semantics understanding on many practical software engineering tasks.

\n", "tags": ["Transformer"], "tsne_embedding": [2.5532820224761963, 12.144387245178223]}, {"key": "peng2023generative", "year": "2023", "title": "Generative Type Inference for Python", "abstract": "

Python is a popular dynamic programming language, evidenced by its ranking as the second most commonly used language on GitHub. However, its dynamic type system can lead to potential type errors, leading researchers to explore automatic type inference approaches for Python programs. The rule-based type inference approaches can ensure the accuracy of predicted variable types, but they suffer from low coverage problems. Supervised type inference approaches, while feature-agnostic, require large, high-quality annotated datasets and are limited to pre-defined types. As zero-shot approaches, the cloze-style approaches reformulate the type inference problem into a fill-in-the-blank problem. However, their performance is limited. This paper introduces TypeGen, a few-shot generative type inference approach that incorporates static domain knowledge from static analysis. TypeGen creates chain-of-thought (COT) prompts by translating the type inference steps of static analysis into prompts based on the type dependency graphs (TDGs), enabling language models to learn from how static analysis infers types. By combining COT prompts with code slices and type hints, TypeGen constructs example prompts from human annotations. TypeGen only requires very few annotated examples to teach language models to generate similar COT prompts via in-context learning. Moreover, TypeGen enhances the interpretability of results through the use of the input-explanation-output strategy. Experiments show that TypeGen outperforms the best baseline Type4Py by 10.0% for argument type prediction and 22.5% in return value type prediction in terms of top-1 Exact Match by using only five examples. Furthermore, TypeGen achieves substantial improvements of 27% to 84% compared to the zero-shot performance of large language models with parameter sizes ranging from 1.3B to 175B in terms of top-1 Exact Match.

\n", "tags": ["types"], "tsne_embedding": [-2.2891433238983154, 29.11396026611328]}, {"key": "phan2021cotext", "year": "2021", "title": "CoTexT: Multi-task Learning with Code-Text Transformer", "abstract": "

We present CoTexT, a transformer-based architecture encoder-decoder pre-trained model that learns the representative context between natural language (NL) and programming language (PL) through multi-task learning. CoTexT is pre-trained, in self-supervised fashion, based on large programming language corpus to learn general-purpose understanding and code-text generation supporting downstream NL-PL task such as code summarizing/documentation, code generation, defect detection, code debugging, etc. We train CoTexT on different combination of available PL corpus including both \u201cbimodal\u201d and \u201cunimodal\u201d data where the former is the combinations of both natural texts and their corresponding code snippets in an input sequence and the latter is merely code snippets. We evaluate multi-task learning CoTexT on different generation and classification tasks on CodeXGLUE and it achieves state-of-the-art on all downstream tasks.

\n", "tags": ["Transformer"], "tsne_embedding": [-5.557413101196289, -3.645583391189575]}, {"key": "piech2015learning", "year": "2015", "title": "Learning Program Embeddings to Propagate Feedback on Student Code", "abstract": "

Providing feedback, both assessing final work\nand giving hints to stuck students, is difficult\nfor open-ended assignments in massive online\nclasses which can range from thousands to millions of students. We introduce a neural network\nmethod to encode programs as a linear mapping\nfrom an embedded precondition space to an embedded postcondition space and propose an algorithm for feedback at scale using these linear maps as features. We apply our algorithm\nto assessments from the Code.org Hour of Code\nand Stanford University\u2019s CS1 course, where we\npropagate human comments on student assignments to orders of magnitude more submissions.

\n", "tags": ["representation", "repair", "education"], "tsne_embedding": [-14.286341667175293, 17.674142837524414]}, {"key": "poesia2022synchromesh", "year": "2022", "title": "Synchromesh: Reliable code generation from pre-trained language models", "abstract": "

Large pre-trained language models have been used to generate code,providing a flexible interface for synthesizing programs from natural language specifications. However, they often violate syntactic and semantic rules of their output language, limiting their practical usability. In this paper, we propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation. Synchromesh comprises two components. First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection. TST learns to recognize utterances that describe similar target programs despite differences in surface natural language features. Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD): a general framework for constraining the output to a set of valid programs in the target language. CSD leverages constraints on partial outputs to sample complete correct programs, and needs neither re-training nor fine-tuning of the language model. We evaluate our methods by synthesizing code from natural language descriptions using GPT-3 and Codex in three real-world languages: SQL queries, Vega-Lite visualizations and SMCalFlow programs. These domains showcase rich constraints that CSD is able to enforce, including syntax, scope, typing rules, and contextual logic. We observe substantial complementary gains from CSD and TST in prediction accuracy and in effectively preventing run-time errors.

\n", "tags": ["Transformer", "language model"], "tsne_embedding": [4.827783107757568, 2.2384934425354004]}, {"key": "popov2021time", "year": "2021", "title": "Time-Efficient Code Completion Model for the R Programming Language", "abstract": "

In this paper we present a deep learning code completion model for the R language. We introduce several techniques to utilize language modeling based architecture in the code completion task. With these techniques, the model requires low resources, but still achieves high quality. We also present an evaluation dataset for the R language completion task. Our dataset contains multiple autocompletion usage contexts that provides robust validation results. The dataset is publicly available.

\n", "tags": ["dataset", "language model", "code generation", "Transformer"], "tsne_embedding": [-6.821138858795166, 6.55757474899292]}, {"key": "pradel2017deep", "year": "2017", "title": "Deep Learning to Find Bugs", "abstract": "

Automated bug detection, e.g., through pattern-based static\nanalysis, is an increasingly popular technique to find programming errors and other code quality issues. Traditionally,\nbug detectors are program analyses that are manually written and carefully tuned by an analysis expert. Unfortunately,\nthe huge amount of possible bug patterns makes it difficult\nto cover more than a small fraction of all bugs. This paper\npresents a new approach toward creating bug detectors. The\nbasic idea is to replace manually writing a program analysis\nwith training a machine learning model that distinguishes\nbuggy from non-buggy code. To address the challenge that\neffective learning requires both positive and negative train-\ning examples, we use simple code transformations that create likely incorrect code from existing code examples. We\npresent a general framework, called DeepBugs, that extracts\npositive training examples from a code corpus, leverages\nsimple program transformations to create negative training\nexamples, trains a model to distinguish these two, and then\nuses the trained model for identifying programming mistakes in previously unseen code. As a proof of concept, we\ncreate four bug detectors for JavaScript that find a diverse set\nof programming mistakes, e.g., accidentally swapped function arguments, incorrect assignments, and incorrect binary\noperations. To find bugs, the trained models use information\nthat is usually discarded by program analyses, such as identifier names of variables and functions. Applying the approach\nto a corpus of 150,000 JavaScript files shows that learned bug\ndetectors have a high accuracy, are very efficient, and reveal\n132 programming mistakes in real-world code.

\n\n", "tags": ["defect", "program analysis"], "tsne_embedding": [17.783462524414062, 4.629268646240234]}, {"key": "pradel2019typewriter", "year": "2019", "title": "TypeWriter: Neural Type Prediction with Search-based Validation", "abstract": "

Maintaining large code bases written in dynamically typed languages, such as JavaScript or Python, can be challenging: simple data compatibility errors proliferate, IDE support is lacking and APIs are harder to comprehend. Recent work attempts to address those issues through either static analysis or probabilistic type inference. Unfortunately, static type inference for dynamic languages is inherently limited, while probabilistic approaches suffer from imprecision. This paper presents TypeWriter, the first combination of probabilistic prediction with search-based refinement of predicted types. TypeWriter\u2019s predictor learns to infer the return and argument types for functions from partially annotated code bases by combining the natural language properties of code with programming language-level information. To validate predicted types, TypeWriter invokes a gradual type checker with different combinations of the predicted types, while navigating the space of possible type combinations in a feedback-directed manner. We implement the TypeWriter approach for Python and evaluate it on two code corpora: a multi-million line code base at Facebook and a collection of 500 popular open-source projects. We show that TypeWriter\u2019s type predictor achieves a precision of 64% (91%) and a recall of 52% (68%) in the top-1 (top-5) predictions, and demonstrate that usage contexts are a helpful addition to neural type predictors. By combining predictions with search-based validation, TypeWriter can fully annotate between 42% to 64% of the files in a randomly selected corpus, while ensuring type correctness. A comparison with a static type inference tool shows that TypeWriter adds many more non-trivial types. Overall, TypeWriter provides developers with an effective way to help with the transition to fully type-annotated code.

\n", "tags": ["types", "bimodal"], "tsne_embedding": [-2.500756025314331, 28.28154945373535]}, {"key": "pradel2020neural", "year": "2020", "title": "Neural Software Analysis", "abstract": "

Many software development problems can be addressed by program analysis tools, which traditionally are based on precise, logical reasoning and heuristics to ensure that the tools are practical. Recent work has shown tremendous success through an alternative way of creating developer tools, which we call neural software analysis. The key idea is to train a neural machine learning model on numerous code examples, which, once trained, makes predictions about previously unseen code. In contrast to traditional program analysis, neural software analysis naturally handles fuzzy information, such as coding conventions and natural language embedded in code, without relying on manually encoded heuristics. This article gives an overview of neural software analysis, discusses when to (not) use it, and presents three example analyses. The analyses address challenging software development problems: bug detection, type prediction, and code completion. The resulting tools complement and outperform traditional program analyses, and are used in industrial practice.

\n", "tags": ["program analysis", "survey"], "tsne_embedding": [14.83123779296875, 6.182992935180664]}, {"key": "pravilov2021unsupervised", "year": "2021", "title": "Unsupervised Learning of General-Purpose Embeddings for Code Changes", "abstract": "

Applying machine learning to tasks that operate with code changes requires their numerical representation. In this work, we propose an approach for obtaining such representations during pre-training and evaluate them on two different downstream tasks - applying changes to code and commit message generation. During pre-training, the model learns to apply the given code change in a correct way. This task requires only code changes themselves, which makes it unsupervised. In the task of applying code changes, our model outperforms baseline models by 5.9 percentage points in accuracy. As for the commit message generation, our model demonstrated the same results as supervised models trained for this specific task, which indicates that it can encode code changes well and can be improved in the future by pre-training on a larger dataset of easily gathered code changes.

\n", "tags": ["edit", "representation"], "tsne_embedding": [-8.959426879882812, 0.1838078647851944]}, {"key": "proksch2015intelligent", "year": "2015", "title": "Intelligent Code Completion with Bayesian Networks", "abstract": "

Code completion is an integral part of modern Integrated Development Environments (IDEs). Developers\noften use it to explore Application Programming Interfaces (APIs). It is also useful to reduce the required\namount of typing and to help avoid typos. Traditional code completion systems propose all type-correct\nmethods to the developer. Such a list is often very long with many irrelevant items. More intelligent code\ncompletion systems have been proposed in prior work to reduce the list of proposed methods to relevant\nitems.

\n\n

This work extends one of these existing approaches, the Best Matching Neighbor (BMN) algorithm. We\nintroduce Bayesian networks as an alternative underlying model, use additional context information for\nmore precise recommendations, and apply clustering techniques to improve model sizes. We compare our\nnew approach, Pattern-based Bayesian Networks (PBN), to the existing BMN algorithm. We extend previously used evaluation methodologies and, in addition to prediction quality, we also evaluate model size and\ninference speed.

\n\n

Our results show that the additional context information we collect improves prediction quality, especially\nfor queries that do not contain method calls. We also show that PBN can obtain comparable prediction\nquality to BMN, while model size and inference speed scale better with large input sizes.

\n", "tags": ["autocomplete"], "tsne_embedding": [-7.698126316070557, -18.586570739746094]}, {"key": "pu2016skp", "year": "2016", "title": "sk_p: a neural program corrector for MOOCs", "abstract": "

We present a novel technique for automatic program correction in MOOCs, capable of fixing both syntactic and semantic errors without manual, problem specific correction strategies. Given an incorrect student program, it generates candidate programs from a distribution of likely corrections, and checks each candidate for correctness against a test suite.

\n\n

The key observation is that in MOOCs many programs share similar code fragments, and the seq2seq neural network model, used in the natural-language processing task of machine translation, can be modified and trained to recover these fragments.

\n\n

Experiment shows our scheme can correct 29% of all incorrect submissions and out-performs state of the art approach which requires manual, problem specific correction strategies.

\n", "tags": ["repair"], "tsne_embedding": [20.381088256835938, -4.790850639343262]}, {"key": "puri2021project", "year": "2021", "title": "Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks", "abstract": "

Advancements in deep learning and machine learning algorithms have enabled\nbreakthrough progress in computer vision, speech recognition, natural language\nprocessing and beyond. In addition, over the last several decades, software has\nbeen built into the fabric of every aspect of our society. Together, these two\ntrends have generated new interest in the fast-emerging research area of \u201cAI for\nCode\u201d. As software development becomes ubiquitous across all industries and code\ninfrastructure of enterprise legacy applications ages, it is more critical than ever\nto increase software development productivity and modernize legacy applications.\nOver the last decade, datasets like ImageNet, with its large scale and diversity,\nhave played a pivotal role in algorithmic advancements from computer vision to\nlanguage and speech understanding. In this paper, we present \u201cProject CodeNet\u201d,\na first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate\nthe algorithmic advancements in AI for Code. It consists of 14M code samples\nand about 500M lines of code in 55 different programming languages. Project\nCodeNet is not only unique in its scale, but also in the diversity of coding tasks\nit can help benchmark: from code similarity and classification for advances in\ncode recommendation algorithms, and code translation between a large variety\nprogramming languages, to advances in code performance (both runtime, and\nmemory) improvement techniques. CodeNet also provides sample input and output\ntest sets for over 7M code samples, which can be critical for determining code\nequivalence in different languages. As a usability feature, we provide several \npreprocessing tools in Project CodeNet to transform source codes into representations\nthat can be readily used as inputs into machine learning models.

\n", "tags": ["dataset"], "tsne_embedding": [-2.503654718399048, 18.223236083984375]}, {"key": "rabin2019testing", "year": "2019", "title": "Testing Neural Program Analyzers", "abstract": "

Deep neural networks have been increasingly used in software engineering and program analysis tasks. They usually take a program and make some predictions about it, e.g., bug prediction. We call these models neural program analyzers. The reliability of neural programs can impact the reliability of the encompassing analyses. In this paper, we describe our ongoing efforts to develop effective techniques for testing neural programs. We discuss the challenges involved in developing such tools and our future plans. In our preliminary experiment on a neural model recently proposed in the literature, we found that the model is very brittle, and simple perturbations in the input can cause the model to make mistakes in its prediction.

\n", "tags": ["evaluation", "refactoring"], "tsne_embedding": [14.908533096313477, 5.496025562286377]}, {"key": "rabin2020demystifying", "year": "2020", "title": "Towards Demystifying Dimensions of Source Code Embeddings", "abstract": "

Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations.

\n", "tags": ["evaluation", "representation", "naming", "interpretability"], "tsne_embedding": [3.7919323444366455, -12.140810012817383]}, {"key": "rabin2021generalizability", "year": "2021", "title": "On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations", "abstract": "

With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees. On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Our results on the generalizability of neural program models provide insights to measure their limitations and provide a stepping stone for their improvement.

\n", "tags": ["evaluation", "adversarial", "generalizability", "refactoring", "summarization"], "tsne_embedding": [3.1468701362609863, 11.64459228515625]}, {"key": "rabin2021understanding", "year": "2021", "title": "Understanding Neural Code Intelligence Through Program Simplification", "abstract": "

A wide range of code intelligence (CI) tools, powered by deep neural networks, have been developed recently to improve programming productivity and perform program analysis. To reliably use such tools, developers often need to reason about the behavior of the underlying models and the factors that affect them. This is especially challenging for tools backed by deep neural networks. Various methods have tried to reduce this opacity in the vein of \u201ctransparent/interpretable-AI\u201d. However, these approaches are often specific to a particular set of network architectures, even requiring access to the network\u2019s parameters. This makes them difficult to use for the average programmer, which hinders the reliable adoption of neural CI systems. In this paper, we propose a simple, model-agnostic approach to identify critical input features for models in CI systems, by drawing on software debugging research, specifically delta debugging. Our approach, SIVAND, uses simplification techniques that reduce the size of input programs of a CI model while preserving the predictions of the model. We show that this approach yields remarkably small outputs and is broadly applicable across many model architectures and problem domains. We find that the models in our experiments often rely heavily on just a few syntactic features in input programs. We believe that SIVAND\u2019s extracted features may help understand neural CI systems\u2019 predictions and learned behavior.

\n", "tags": ["interpretability", "refactoring", "information extraction"], "tsne_embedding": [8.06714153289795, 12.674334526062012]}, {"key": "rabin2022memorization", "year": "2022", "title": "Memorization and Generalization in Neural Code Intelligence Models", "abstract": "

Deep Neural Networks (DNNs) are increasingly being used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, their large capacity can render them prone to memorizing data points. Recent work suggests that the memorization risk manifests especially strongly when the training dataset is noisy, involving many ambiguous or questionable samples, and memorization is the only recourse. The goal of this paper is to evaluate and compare the extent of memorization and generalization in neural code intelligence models. It aims to provide insights on how memorization may impact the learning behavior of neural models in code intelligence systems. To observe the extent of memorization in models, we add random noise to the original training dataset and use various metrics to quantify the impact of noise on various aspects of training and testing. We evaluate several state-of-the-art neural code intelligence models and benchmarks based on Java, Python, and Ruby codebases. Our results highlight important risks: millions of trainable parameters allow the neural networks to memorize anything, including noisy data, and provide a false sense of generalization. We observed all models manifest some forms of memorization. This can be potentially troublesome in most code intelligence tasks where they rely on rather noise-prone and repetitive data sources, such as code from GitHub. To the best of our knowledge, we provide the first study to quantify memorization effects in the domain of software engineering and code intelligence systems. This work raises awareness and provides new insights into important issues of training neural models in code intelligence systems that are usually overlooked by software engineering researchers.

\n", "tags": ["evaluation", "memorization", "generalizability", "refactoring", "language model"], "tsne_embedding": [-1.563639521598816, 6.0610270500183105]}, {"key": "rabin2022understanding", "year": "2022", "title": "Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models", "abstract": "

Neural code intelligence (CI) models are opaque black-boxes and offer little insight on the features they use in making predictions. This opacity may lead to distrust in their prediction and hamper their wider adoption in safety-critical applications. Recently, input program reduction techniques have been proposed to identify key features in the input programs to improve the transparency of CI models. However, this approach is syntax-unaware and does not consider the grammar of the programming language. In this paper, we apply a syntax-guided program reduction technique that considers the grammar of the input programs during reduction. Our experiments on multiple models across different types of input programs show that the syntax-guided program reduction technique is faster and provides smaller sets of key tokens in reduced programs. We also show that the key tokens could be used in generating adversarial examples for up to 65% of the input programs.

\n", "tags": ["interpretability", "refactoring", "adversarial"], "tsne_embedding": [8.532830238342285, 13.030779838562012]}, {"key": "rabinovich2017abstract", "year": "2017", "title": "Abstract Syntax Networks for Code Generation and Semantic Parsing", "abstract": "

Tasks like code generation and semantic parsing require mapping unstructured (or partially structured) inputs to well-formed, executable outputs. We introduce abstract syntax networks, a modeling framework for these problems. The outputs are represented as abstract syntax trees (ASTs) and constructed by a decoder with a dynamically-determined modular structure paralleling the structure of the output tree. On the benchmark Hearthstone dataset for code generation, our model obtains 79.2 BLEU and 22.7% exact match accuracy, compared to previous state-of-the-art values of 67.1 and 6.1%. Furthermore, we perform competitively on the Atis, Jobs, and Geo semantic parsing datasets with no task-specific engineering.

\n", "tags": ["code generation", "grammar"], "tsne_embedding": [-22.22745132446289, -4.585859775543213]}, {"key": "raghothaman2018user", "year": "2018", "title": "User-guided program reasoning using Bayesian inference", "abstract": "

Program analyses necessarily make approximations that often lead them to report true alarms interspersed with many false alarms. We propose a new approach to leverage user feedback to guide program analyses towards true alarms and away from false alarms. Our approach associates each alarm with a confidence value by performing Bayesian inference on a probabilistic model derived from the analysis rules. In each iteration, the user inspects the alarm with the highest confidence and labels its ground truth, and the approach recomputes the confidences of the remaining alarms given this feedback. It thereby maximizes the return on the effort by the user in inspecting each alarm. We have implemented our approach in a tool named Bingo for program analyses expressed in Datalog. Experiments with real users and two sophisticated analyses\u2014a static datarace analysis for Java programs and a static taint analysis for Android apps\u2014show significant improvements on a range of metrics, including false alarm rates and number of bugs found.

\n", "tags": ["program analysis"], "tsne_embedding": [23.333520889282227, 12.919833183288574]}, {"key": "rahman2019natural", "year": "2019", "title": "Natural Software Revisited", "abstract": "

Recent works have concluded that software is more repetitive and predictable, i.e. more natural, than English texts. These works included \u201csimple/artificial\u201d syntax rules in their language models. When we remove SyntaxTokens we find that code is still repetitive and predictable but only at levels slightly above English. Furthermore, previous works have compared individual Java programs to general English corpora, such as Gutenberg, which contains a historically large range of styles and subjects (e.g. Saint Augustine to Oscar Wilde). We perform an additional comparison of technical StackOverflow English discussions with source code and find that this restricted English is similarly repetitive to code. Although we find that code is less repetitive than previously thought, we suspect that API code element usage will be repetitive across software projects. For example a file is opened and closed in the same manner irrespective of domain. When we restrict our n-grams to those contained in the Java API we find that the entropy is significantly lower than the English corpora. Previous works have focused on sequential sequences of tokens. When we extract program graphs of size 2, 3, and 4 nodes we see that the abstract graph representation is much more concise and repetitive than the sequential representations of the same code. This suggests that future work should focus on statistical graph models that go beyond linear sequences of tokens. Our anonymous replication package makes our scripts and data available to future researchers and reviewers.

\n", "tags": [], "tsne_embedding": [-14.063563346862793, -17.285057067871094]}, {"key": "ramakrishnan2020backdoors", "year": "2022", "title": "Backdoors in Neural Models of Source Code", "abstract": "

Deep neural networks are vulnerable to a range of adversaries. A particularly pernicious class of vulnerabilities are backdoors, where model predictions diverge in the presence of subtle triggers in inputs. An attacker can implant a backdoor by poisoning the training data to yield a desired target prediction on triggered inputs. We study backdoors in the context of deep-learning for source code. (1) We define a range of backdoor classes for source-code tasks and show how to poison a dataset to install such backdoors. (2) We adapt and improve recent algorithms from robust statistics for our setting, showing that backdoors leave a spectral signature in the learned representation of source code, thus enabling detection of poisoned data. (3) We conduct a thorough evaluation on different architectures and languages, showing the ease of injecting backdoors and our ability to eliminate them.

\n", "tags": ["adversarial"], "tsne_embedding": [10.072778701782227, 20.005273818969727]}, {"key": "ray2015naturalness", "year": "2015", "title": "On the \u201cNaturalness\u201d of Buggy Code", "abstract": "

Real software, the kind working programmers produce by the kLOC\nto solve real-world problems, tends to be \u201cnatural\u201d, like speech or\nnatural language; it tends to be highly repetitive and predictable.\nResearchers have captured this naturalness of software through statistical models and used them to good effect in suggestion engines,\nporting tools, coding standards checkers, and idiom miners. This\nsuggests that code that appears improbable, or surprising, to a good\nstatistical language model is \u201cunnatural\u201d in some sense, and thus\npossibly suspicious. In this paper, we investigate this hypothesis. We consider a large corpus of bug fix commits (ca. 8,296),\nfrom 10 different Java projects, and we focus on its language statistics, evaluating the naturalness of buggy code and the corresponding fixes. We find that code with bugs tends to be more entropic\n(i.e. unnatural), becoming less so as bugs are fixed. Focusing on\nhighly entropic lines is similar in cost-effectiveness to some well-known static bug finders (PMD, FindBugs) and ordering warnings\nfrom these bug finders using an entropy measure improves the cost-effectiveness of inspecting code implicated in warnings. This suggests that entropy may be a valid language-independent and simple\nway to complement the effectiveness of PMD or FindBugs, and\nthat search-based bug-fixing methods may benefit from using entropy both for fault-localization and searching for fixes.

\n\n", "tags": ["defect"], "tsne_embedding": [20.399913787841797, 7.593733310699463]}, {"key": "raychev2014code", "year": "2014", "title": "Code Completion with Statistical Language Models", "abstract": "

We address the problem of synthesizing code completions for programs using APIs. Given a program with holes, we synthesize completions for holes with the most likely sequences of method calls.

\n\n

Our main idea is to reduce the problem of code completion to\na natural-language processing problem of predicting probabilities\nof sentences. We design a simple and scalable static analysis that\nextracts sequences of method calls from a large codebase, and\nindex these into a statistical language model. We then employ\nthe language model to find the highest ranked sentences, and use\nthem to synthesize a code completion. Our approach is able to\nsynthesize sequences of calls across multiple objects together with\ntheir arguments.

\n\n

Experiments show that our approach is fast and effective. Virtually all computed completions typecheck, and the desired completion appears in the top 3 results in 90% of the cases.

\n", "tags": ["language model", "autocomplete", "code generation"], "tsne_embedding": [-11.227243423461914, -15.03122615814209]}, {"key": "raychev2015predicting", "year": "2015", "title": "Predicting Program Properties from \u201cBig Code\u201d", "abstract": "

We present a new approach for predicting program properties from\nmassive codebases (aka \u201cBig Code\u201d). Our approach first learns a\nprobabilistic model from existing data and then uses this model to\npredict properties of new, unseen programs.

\n\n

The key idea of our work is to transform the input program into\na representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic\ngraphical models such as conditional random fields (CRFs) in order\nto perform joint prediction of program properties.

\n\n

As an example of our approach, we built a scalable prediction\nengine called JSNICE 1 for solving two kinds of problems in the\ncontext of JavaScript: predicting (syntactic) names of identifiers\nand predicting (semantic) type annotations of variables. Experimentally, JSNICE predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of the\ncases. In the first week since its release, JSN ICE was used by more\nthan 30,000 developers and in only few months has become a popular tool in the JavaScript developer community.

\n\n

By formulating the problem of inferring program properties as\nstructured prediction and showing how to perform both learning\nand inference in this context, our work opens up new possibilities\nfor attacking a wide range of difficult problems in the context of\n\u201cBig Code\u201d including invariant generation, de-compilation, synthesis and others.

\n", "tags": ["program analysis", "naming", "types", "deobfuscation"], "tsne_embedding": [3.6648075580596924, 9.598525047302246]}, {"key": "raychev2016learning", "year": "2016", "title": "Learning Programs from Noisy Data", "abstract": "

We present a new approach for learning programs from noisy\ndatasets. Our approach is based on two new concepts: a regularized\nprogram generator which produces a candidate program based on a\nsmall sample of the entire dataset while avoiding overfitting, and a\ndataset sampler which carefully samples the dataset by leveraging\nthe candidate program\u2019s score on that dataset. The two components\nare connected in a continuous feedback-directed loop.

\n\n

We show how to apply this approach to two settings: one where\nthe dataset has a bound on the noise, and another without a noise\nbound. The second setting leads to a new way of performing\napproximate empirical risk minimization on hypotheses classes\nformed by a discrete search space.

\n\n

We then present two new kinds of program synthesizers which\ntarget the two noise settings. First, we introduce a novel regularized\nbitstream synthesizer that successfully generates programs even in\nthe presence of incorrect examples. We show that the synthesizer\ncan detect errors in the examples while combating overfitting \u2013\na major problem in existing synthesis techniques. We also show\nhow the approach can be used in a setting where the dataset grows\ndynamically via new examples (e.g., provided by a human).

\n\n

Second, we present a novel technique for constructing statistical\ncode completion systems. These are systems trained on massive\ndatasets of open source programs, also known as \u201cBig Code\u201d. The\nkey idea is to introduce a domain specific language (DSL) over\ntrees and to learn functions in that DSL directly from the dataset.\nThese learned functions then condition the predictions made by the\nsystem. This is a flexible and powerful technique which generalizes\nseveral existing works as we no longer need to decide a priori on\nwhat the prediction should be conditioned (another benefit is that\nthe learned functions are a natural mechanism for explaining the\nprediction). As a result, our code completion system surpasses the\nprediction capabilities of existing, hard-wired systems.

\n", "tags": ["code generation", "grammar"], "tsne_embedding": [7.950240612030029, 8.844087600708008]}, {"key": "reid2022learning", "year": "2022", "title": "Learning to Model Editing Processes", "abstract": "

Most existing sequence generation models produce outputs in one pass, usually left-to-right. However, this is in contrast with a more natural approach that humans use in generating content; iterative refinement and editing. Recent work has introduced edit-based models for various tasks (such as neural machine translation and text style transfer), but these generally model a single edit step. In this work, we propose modeling editing processes, modeling the whole process of iteratively generating sequences. We form a conceptual framework to describe the likelihood of multi-step edits, and describe neural models that can learn a generative model of sequences based on these multistep edits. We introduce baseline results and metrics on this task, finding that modeling editing processes improves performance on a variety of axes on both our proposed task and related downstream tasks compared to previous single-step models of edits.

\n", "tags": ["Transformer", "edit"], "tsne_embedding": [-12.229606628417969, -0.6995916366577148]}, {"key": "ren2020codebleu", "year": "2020", "title": "CodeBLEU: a Method for Automatic Evaluation of Code Synthesis", "abstract": "

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.

\n", "tags": ["evaluation"], "tsne_embedding": [6.542486190795898, -10.662931442260742]}, {"key": "richardson2017code2text", "year": "2017", "title": "The Code2Text Challenge: Text Generation in Source Code Libraries", "abstract": "

We propose a new shared task for tactical data-to-text generation in the domain of source code libraries. Specifically, we focus on text generation of function descriptions from example software projects. Data is drawn from existing resources used for studying the related problem of semantic parser induction (Richardson and Kuhn, 2017b; Richardson and Kuhn, 2017a), and spans a wide variety of both natural languages and programming languages. In this paper, we describe these existing resources, which will serve as training and development data for the task, and discuss plans for building new independent test sets.

\n", "tags": ["bimodal"], "tsne_embedding": [-8.73790454864502, -9.176773071289062]}, {"key": "richardson2017function", "year": "2017", "title": "Function Assistant: A Tool for NL Querying of APIs", "abstract": "

In this paper, we describe Function Assistant, a lightweight Python-based toolkit for querying and exploring source code repositories using natural language. The toolkit is designed to help end-users of a target API quickly find information about functions through high-level natural language queries and descriptions. For a given text query and background API, the tool finds candidate functions by performing a translation from the text to known representations in the API using the semantic parsing approach of Richardson and Kuhn (2017). Translations are automatically learned from example text-code pairs in example APIs. The toolkit includes features for building translation pipelines and query engines for arbitrary source code projects. To explore this last feature, we perform new experiments on 27 well-known Python projects hosted on Github.

\n", "tags": ["bimodal", "API"], "tsne_embedding": [-7.223910331726074, -11.76626968383789]}, {"key": "richardson2017learning", "year": "2017", "title": "Learning Technical Correspondences in Technical Documentation", "abstract": "

We consider the problem of translating high-level textual descriptions to formal representations in technical documentation as part of an effort to model the meaning of such documentation. We focus specifically on the problem of learning translational correspondences between text descriptions and grounded representations in the target documentation, such as formal representation of functions or code templates. Our approach exploits the parallel nature of such documentation, or the tight coupling between high-level text and the low-level representations we aim to learn. Data is collected by mining technical documents for such parallel text-representation pairs, which we use to train a simple semantic parsing model. We report new baseline results on sixteen novel datasets, including the standard library documentation for nine popular programming languages across seven natural languages, and a small collection of Unix utility manuals.

\n", "tags": ["documentation", "API", "bimodal"], "tsne_embedding": [-8.278753280639648, -11.140986442565918]}, {"key": "richardson2018polyglot", "year": "2018", "title": "Polyglot Semantic Parsing in APIs", "abstract": "

Traditional approaches to semantic parsing (SP) work by training individual models for each available parallel dataset of text-meaning pairs. In this paper, we explore the idea of polyglot semantic translation, or learning semantic parsing models that are trained on multiple datasets and natural languages. In particular, we focus on translating text to code signature representations using the software component datasets of Richardson and Kuhn (2017a,b). The advantage of such models is that they can be used for parsing a wide variety of input natural languages and output programming languages, or mixed input languages, using a single unified model. To facilitate modeling of this type, we develop a novel graph-based decoding framework that achieves state-of-the-art performance on the above datasets, and apply this method to two other benchmark SP tasks.

\n", "tags": ["bimodal", "API"], "tsne_embedding": [-22.73819923400879, -5.375976085662842]}, {"key": "richter2022can", "year": "2022", "title": "Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes", "abstract": "

Real bug fixes found in open source repositories seem to be the perfect source for learning to localize and repair real bugs. However, the absence of large scale bug fix collections has made it difficult to effectively exploit real bug fixes in the training of larger neural models in the past. In contrast, artificial bugs \u2013 produced by mutating existing source code \u2013 can be easily obtained at a sufficient scale and are therefore often preferred in the training of existing approaches. Still, localization and repair models that are trained on artificial bugs usually underperform when faced with real bugs. This raises the question whether bug localization and repair models trained on real bug fixes are more effective in localizing and repairing real bugs.

\n\n

We address this question by introducing RealiT, a pre-train-and-fine-tune approach for effectively learning to localize and repair real bugs from real bug fixes. RealiT is first pre-trained on a large number of artificial bugs produced by traditional mutation operators and then fine-tuned on a smaller set of real bug fixes. Fine-tuning does not require any modifications of the learning algorithm and hence can be easily adopted in various training scenarios for bug localization or repair (even when real training data is scarce). In addition, we found that training on real bug fixes with RealiT is empirically powerful by nearly doubling the localization performance of an existing model on real bugs while maintaining or even improving the repair performance.

\n", "tags": ["Transformer", "repair", "defect"], "tsne_embedding": [21.577245712280273, 3.6002347469329834]}, {"key": "roziere2021dobf", "year": "2021", "title": "DOBF: A Deobfuscation Pre-Training Objective for Programming Languages", "abstract": "

Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 13% in unsupervised code translation, and 24% in natural language code search. Incidentally, we found that our pre-trained model is able to de-obfuscate fully obfuscated source files, and to suggest descriptive variable names.

\n", "tags": ["pretraining"], "tsne_embedding": [-2.3975305557250977, -4.464101791381836]}, {"key": "roziere2021leveraging", "year": "2021", "title": "Leveraging Automated Unit Tests for Unsupervised Code Translation", "abstract": "

With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java \u2192 Python and Python \u2192 C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.

\n", "tags": ["migration"], "tsne_embedding": [1.754235029220581, -20.610811233520508]}, {"key": "russell2018automated", "year": "2018", "title": "Automated Vulnerability Detection in Source Code Using Deep Representation Learning", "abstract": "

Increasing numbers of software vulnerabilities are discovered every year whether they are reported publicly or discovered internally in proprietary code. These vulnerabilities can pose serious risk of exploit and result in system compromise, information leaks, or denial of service. We leveraged the wealth of C and C++ open-source code available to develop a large-scale function-level vulnerability detection system using machine learning. To supplement existing labeled vulnerability datasets, we compiled a vast dataset of millions of open-source functions and labeled it with carefully-selected findings from three different static analyzers that indicate potential exploits. Using these datasets, we developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code. We evaluated our tool on code from both real software packages and the NIST SATE IV benchmark dataset. Our results demonstrate that deep feature representation learning on source code is a promising approach for automated software vulnerability detection.

\n", "tags": ["program analysis"], "tsne_embedding": [8.594146728515625, 18.35173797607422]}, {"key": "saberi2023model", "year": "2023", "title": "Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models", "abstract": "

Pre-trained Programming Language Models (PPLMs) achieved many recent states of the art results for many code-related software engineering tasks. Though some studies use data flow or propose tree-based models that utilize Abstract Syntax Tree (AST), most PPLMs do not fully utilize the rich syntactical information in source code. Still, the input is considered a sequence of tokens. There are two issues; the first is computational inefficiency due to the quadratic relationship between input length and attention complexity. Second, any syntactical information, when needed as an extra input to the current PPLMs, requires the model to be pre-trained from scratch, wasting all the computational resources already used for pre-training the current models. In this work, we propose Named Entity Recognition (NER) adapters, lightweight modules that can be inserted into Transformer blocks to learn type information extracted from the AST. These adapters can be used with current PPLMs such as CodeBERT, GraphCodeBERT, and CodeT5. We train the NER adapters using a novel Token Type Classification objective function (TTC). We insert our proposed work in CodeBERT, building CodeBERTER, and evaluate the performance on two tasks of code refinement and code summarization. CodeBERTER improves the accuracy of code refinement from 16.4 to 17.8 while using 20% of training parameter budget compared to the fully fine-tuning approach, and the BLEU score of code summarization from 14.75 to 15.90 while reducing 77% of training parameters compared to the fully fine-tuning approach.

\n", "tags": ["Transformer", "repair", "summarization"], "tsne_embedding": [-8.283045768737793, -5.088901519775391]}, {"key": "sahu2022learning", "year": "2022", "title": "Learning to Answer Semantic Queries over Code", "abstract": "

During software development, developers need answers to queries about semantic aspects of code. Even though extractive question-answering using neural approaches has been studied widely in natural languages, the problem of answering semantic queries over code using neural networks has not yet been explored. This is mainly because there is no existing dataset with extractive question and answer pairs over code involving complex concepts and long chains of reasoning. We bridge this gap by building a new, curated dataset called CodeQueries, and proposing a neural question-answering methodology over code.\nWe build upon state-of-the-art pre-trained models of code to predict answer and supporting-fact spans. Given a query and code, only some of the code may be relevant to answer the query. We first experiment under an ideal setting where only the relevant code is given to the model and show that our models do well. We then experiment under three pragmatic considerations: (1) scaling to large-size code, (2) learning from a limited number of examples and (3) robustness to minor syntax errors in code. Our results show that while a neural model can be resilient to minor syntax errors in code, increasing size of code, presence of code that is not relevant to the query, and reduced number of training examples limit the model performance. We are releasing our data and models to facilitate future work on the proposed problem of answering semantic queries over code.

\n", "tags": ["static analysis", "Transformer"], "tsne_embedding": [-3.9089300632476807, -12.759590148925781]}, {"key": "saini2018oreo", "year": "2018", "title": "Oreo: detection of clones in the twilight zone", "abstract": "

Source code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). Most clone detectors reported in the literature work well up to Type-3, which accounts for syntactic differences. In between Type-3 and Type-4, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect \u2013 the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. We present Oreo, a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones accurately, but is also capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. We evaluate the recall of Oreo on BigCloneBench, and perform manual evaluation for precision. Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner.

\n", "tags": ["clone"], "tsne_embedding": [4.52032470703125, -7.576870441436768]}, {"key": "santos2018syntax", "year": "2018", "title": "Syntax and Sensibility: Using language models to detect and correct syntax errors", "abstract": "

Syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of experience that help them quickly resolve these frustrating errors. Standard LR parsers are of little help, typically resolving syntax errors and their precise location poorly. We propose a methodology that locates where syntax errors occur, and suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by using language models trained on correct source code to find tokens that seem out of place. Fixes are synthesized by consulting the language models to determine what tokens are more likely at the estimated error location. We compare n-gram and LSTM (long short-term memory) language models for this task, each trained on a large corpus of Java code collected from GitHub. Unlike prior work, our methodology does not rely that the problem source code comes from the same domain as the training data. We evaluated against a repository of real student mistakes. Our tools are able to find a syntactically-valid fix within its top-2 suggestions, often producing the exact fix that the student used to resolve the error. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.

\n", "tags": ["repair", "language model"], "tsne_embedding": [17.201770782470703, -4.781204700469971]}, {"key": "saraiva2015products", "year": "2015", "title": "Products, Developers, and Milestones: How Should I Build My N-Gram Language Model", "abstract": "

Recent work has shown that although programming languages en-\nable source code to be rich and complex, most code tends to be\nrepetitive and predictable. The use of natural language processing\n(NLP) techniques applied to source code such as n-gram language\nmodels show great promise in areas such as code completion, aiding impaired developers, and code search. In this paper, we address\nthree questions related to different methods of constructing lan-\nguage models in an industrial context. Specifically, we ask: (1) Do\napplication specific, but smaller language models perform better\nthan language models across applications? (2) Are developer specific language models effective and do they differ depending on\nwhat parts of the codebase a developer is working in? (3) Finally,\ndo language models change over time, i.e., does a language model\nfrom early development model change later on in development?\nThe answers to these questions enable techniques that make use of\nprogramming language models in development to choose the model\ntraining corpus more effectively.

\n\n

We evaluate these questions by building 28 language models across\ndevelopers, time periods, and applications within Microsoft Office\nand present the results in this paper. We find that developer and\napplication specific language models perform better than models\nfrom the entire codebase, but that temporality has little to no effect\non language model performance.

\n", "tags": ["language model"], "tsne_embedding": [-3.7964048385620117, 2.164177417755127]}, {"key": "sarkar2022what", "year": "2022", "title": "What is it like to program with artificial intelligence?", "abstract": "

Large language models, such as OpenAI\u2019s codex and Deepmind\u2019s AlphaCode, can generate code to solve a variety of problems expressed in natural language. This technology has already been commercialised in at least one widely-used programming editor extension: GitHub Copilot.

\n\n

In this paper, we explore how programming with large language models (LLM-assisted programming) is similar to, and differs from, prior conceptualisations of programmer assistance. We draw upon publicly available experience reports of LLM-assisted programming, as well as prior usability and design studies. We find that while LLM-assisted programming shares some properties of compilation, pair programming, and programming via search and reuse, there are fundamental differences both in the technical possibilities as well as the practical experience. Thus, LLM-assisted programming ought to be viewed as a new way of programming with its own distinct properties and challenges.

\n\n

Finally, we draw upon observations from a user study in which non-expert end user programmers use LLM-assisted tools for solving data tasks in spreadsheets. We discuss the issues that might arise, and open research challenges, in applying large language models to end-user programming, particularly with users who have little or no programming expertise.

\n", "tags": ["human evaluation", "review"], "tsne_embedding": [9.448033332824707, -1.693267822265625]}, {"key": "schrouff2019inferring", "year": "2019", "title": "Inferring Javascript types using Graph Neural Networks", "abstract": "

The recent use of `Big Code\u2019 with state-of-the-art deep learning methods offers promising avenues to ease program source code writing and correction. As a first step towards automatic code repair, we implemented a graph neural network model that predicts token types for Javascript programs. The predictions achieve an accuracy above 90%, which improves on previous similar work.

\n", "tags": ["GNN", "types", "program analysis"], "tsne_embedding": [-2.5818190574645996, 23.24529457092285]}, {"key": "schuster2021you", "year": "2021", "title": "You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion", "abstract": "

Code autocompletion is an integral feature of modern code editors and IDEs. The latest generation of autocompleters uses neural language models, trained on public open-source code repositories, to suggest likely (not just statically feasible) completions given the current context.

\n\n

We demonstrate that neural code autocompleters are vulnerable to poisoning attacks. By adding a few specially-crafted files to the autocompleter\u2019s training corpus (data poisoning), or else by directly fine-tuning the autocompleter on these files (model poisoning), the attacker can influence its suggestions for attacker-chosen contexts. For example, the attacker can \u201cteach\u201d the autocompleter to suggest the insecure ECB mode for AES encryption, SSLv3 for the SSL/TLS protocol version, or a low iteration count for password-based encryption. Moreover, we show that these attacks can be targeted: an autocompleter poisoned by a targeted attack is much more likely to suggest the insecure completion for files from a specific repo or specific developer.

\n\n

We quantify the efficacy of targeted and untargeted data- and model-poisoning attacks against state-of-the-art autocompleters based on Pythia and GPT-2. We then evaluate existing defenses against poisoning attacks and show that they are largely ineffective.

\n", "tags": ["autocomplete", "adversarial"], "tsne_embedding": [11.258028984069824, 19.836299896240234]}, {"key": "sharma2015nirmal", "year": "2015", "title": "NIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model", "abstract": "

Twitter is one of the most widely used social media\nplatforms today. It enables users to share and view short 140-character messages called \u201ctweets\u201d. About 284 million active\nusers generate close to 500 million tweets per day. Such rapid\ngeneration of user generated content in large magnitudes results\nin the problem of information overload. Users who are interested\nin information related to a particular domain have limited means\nto filter out irrelevant tweets and tend to get lost in the huge\namount of data they encounter. A recent study by Singer et\nal. found that software developers use Twitter to stay aware of\nindustry trends, to learn from others, and to network with other\ndevelopers. However, Singer et al. also reported that developers\noften find Twitter streams to contain too much noise which is a\nbarrier to the adoption of Twitter. In this paper, to help developers\ncope with noise, we propose a novel approach named NIRMAL,\nwhich automatically identifies software relevant tweets from a\ncollection or stream of tweets. Our approach is based on language\nmodeling which learns a statistical model based on a training\ncorpus (i.e., set of documents). We make use of a subset of posts\nfrom StackOverflow, a programming question and answer site, as\na training corpus to learn a language model. A corpus of tweets\nwas then used to test the effectiveness of the trained language\nmodel. The tweets were sorted based on the rank the model\nassigned to each of the individual tweets. The top 200 tweets\nwere then manually analyzed to verify whether they are software\nrelated or not, and then an accuracy score was calculated. The\nresults show that decent accuracy scores can be achieved by\nvarious variants of NIRMAL, which indicates that NIRMAL can\neffectively identify software related tweets from a huge corpus of\ntweets.

\n", "tags": ["information extraction"], "tsne_embedding": [-4.293123245239258, -23.591917037963867]}, {"key": "sharma2019feasibility", "year": "2019", "title": "On the Feasibility of Transfer-learning Code Smells using Deep Learning", "abstract": "

Context: A substantial amount of work has been done to detect smells in source code using metrics-based and heuristics-based methods. Machine learning methods have been recently applied to detect source code smells; however, the current practices are considered far from mature.

\n\n

Objective: First, explore the feasibility of applying deep learning models to detect smells without extensive feature engineering, just by feeding the source code in tokenized form. Second, investigate the possibility of applying transfer-learning in the context of deep learning models for smell detection.

\n\n

Method: We use existing metric-based state-of-the-art methods for detecting three implementation smells and one design smell in C# code. Using these results as the annotated gold standard, we train smell detection models on three different deep learning architectures. These architectures use Convolution Neural Networks (CNNs) of one or two dimensions, or Recurrent Neural Networks (RNNs) as their principal hidden layers. For the first objective of our study, we perform training and evaluation on C# samples, whereas for the second objective, we train the models from C# code and evaluate the models over Java code samples. We perform the experiments with various combinations of hyper-parameters for each model.

\n\n

Results: We find it feasible to detect smells using deep learning methods. Our comparative experiments find that there is no clearly superior method between CNN-1D and CNN-2D. We also observe that performance of the deep learning models is smell-specific. Our transfer-learning experiments show that transfer-learning is definitely feasible for implementation smells with performance comparable to that of direct-learning. This work opens up a new paradigm to detect code smells by transfer-learning especially for the programming languages where the comprehensive code smell detection tools are not available.

\n", "tags": ["representation", "program analysis"], "tsne_embedding": [0.5098184943199158, 20.38808250427246]}, {"key": "sharma2022exploratory", "year": "2022", "title": "An Exploratory Study on Code Attention in BERT", "abstract": "

Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled. Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers\u2019 embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21\u201324% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.

\n\n", "tags": ["Transformer", "representation", "language model", "interpretability", "pretraining", "clone"], "tsne_embedding": [-5.47562837600708, -4.667429447174072]}, {"key": "sharma2022lamner", "year": "2022", "title": "LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition", "abstract": "

Code comment generation is the task of generating a high-level natural language description for a given code method/function. Although researchers have been studying multiple ways to generate code comments automatically, previous work mainly considers representing a code token in its entirety semantics form only (e.g., a language model is used to learn the semantics of a code token), and additional code properties such as the tree structure of a code are included as an auxiliary input to the model. There are two limitations: 1) Learning the code token in its entirety form may not be able to capture information succinctly in source code, and 2)The code token does not contain additional syntactic information, inherently important in programming languages. In this paper, we present LAnguage Model and Named Entity Recognition (LAMNER), a code comment generator capable of encoding code constructs effectively and capturing the structural property of a code token. A character-level language model is used to learn the semantic representation to encode a code token. For the structural property of a token, a Named Entity Recognition model is trained to learn the different types of code tokens. These representations are then fed into an encoder-decoder architecture to generate code comments. We evaluate the generated comments from LAMNER and other baselines on a popular Java dataset with four commonly used metrics. Our results show that LAMNER is effective and improves over the best baseline model in BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR, and CIDEr by 14.34%, 18.98%, 21.55%, 23.00%, 10.52%, 1.44%, and 25.86%, respectively. Additionally, we fused LAMNER\u2019s code representation with the baseline models, and the fused models consistently showed improvement over the nonfused models. The human evaluation further shows that LAMNER produces high-quality code comments.

\n\n", "tags": ["summarization", "documentation", "language model", "types", "representation"], "tsne_embedding": [-8.299118995666504, -5.583765983581543]}, {"key": "she2019neuzz", "year": "2019", "title": "NEUZZ: Efficient Fuzzing with Neural Program Smoothing", "abstract": "

Fuzzing has become the de facto standard technique for finding software vulnerabilities. However, even state-of-the-art fuzzers are not very efficient at finding hard-to-trigger software bugs. Most popular fuzzers use evolutionary guidance to generate inputs that can trigger different bugs. Such evolutionary algorithms, while fast and simple to implement, often get stuck in fruitless sequences of random mutations. Gradient-guided optimization presents a promising alternative to evolutionary guidance. Gradient-guided techniques have been shown to significantly outperform evolutionary algorithms at solving high-dimensional structured optimization problems in domains like machine learning by efficiently utilizing gradients or higher-order derivatives of the underlying function. However, gradient-guided approaches are not directly applicable to fuzzing as real-world program behaviors contain many discontinuities, plateaus, and ridges where the gradient-based methods often get stuck. We observe that this problem can be addressed by creating a smooth surrogate function approximating the discrete branching behavior of target program. In this paper, we propose a novel program smoothing technique using surrogate neural network models that can incrementally learn smooth approximations of a complex, real-world program\u2019s branching behaviors. We further demonstrate that such neural network models can be used together with gradient-guided input generation schemes to significantly improve the fuzzing efficiency. Our extensive evaluations demonstrate that NEUZZ significantly outperforms 10 state-of-the-art graybox fuzzers on 10 real-world programs both at finding new bugs and achieving higher edge coverage. NEUZZ found 31 unknown bugs that other fuzzers failed to find in 10 real world programs and achieved 3X more edge coverage than all of the tested graybox fuzzers for 24 hours running.

\n", "tags": ["fuzzing"], "tsne_embedding": [16.835744857788086, 13.821191787719727]}, {"key": "shi2019learning", "year": "2019", "title": "Learning Execution through Neural Code Fusion", "abstract": "

As the performance of computer systems stagnates due to the end of Moore\u2019s Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn representations of source code, these representations do not understand how code dynamically executes. In this work, we propose a new approach to use GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and complex data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related task (algorithm classification).

\n", "tags": ["representation"], "tsne_embedding": [1.513840675354004, 16.364295959472656]}, {"key": "shi2022cv4code", "year": "2022", "title": "CV4Code: Sourcecode Understanding via Visual Code Representations", "abstract": "

We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.

\n", "tags": ["code similarity", "Transformer"], "tsne_embedding": [-2.9902937412261963, 18.02239990234375]}, {"key": "shido2019automatic", "year": "2019", "title": "Automatic Source Code Summarization with Extended Tree-LSTM", "abstract": "

Neural machine translation models are used to automatically generate a document from given source code since this can be regarded as a machine translation task. Source code summarization is one of the components for automatic document generation, which generates a summary in natural language from given source code. This suggests that techniques used in neural machine translation, such as Long Short-Term Memory (LSTM), can be used for source code summarization. However, there is a considerable difference between source code and natural language: Source code is essentially structured, having loops and conditional branching, etc. Therefore, there is some obstacle to apply known machine translation models to source code.Abstract syntax trees (ASTs) capture these structural properties and play an important role in recent machine learning studies on source code. Tree-LSTM is proposed as a generalization of LSTMs for tree-structured data. However, there is a critical issue when applying it to ASTs: It cannot handle a tree that contains nodes having an arbitrary number of children and their order simultaneously, which ASTs generally have such nodes. To address this issue, we propose an extension of Tree-LSTM, which we call Multi-way Tree-LSTM and apply it for source code summarization. As a result of computational experiments, our proposal achieved better results when compared with several state-of-the-art techniques.

\n", "tags": ["summarization", "grammar"], "tsne_embedding": [-14.440337181091309, -6.6632256507873535]}, {"key": "shirani2018evaluation", "year": "2018", "title": "Evaluation of Type Inference with Textual Cues", "abstract": "

Type information plays an important role in the success of information retrieval and recommendation systems in software\nengineering. Thus, the absence of types in dynamically-typed\nlanguages poses a challenge to adapt these systems to support\ndynamic languages.

\n\n

In this paper, we explore the viability of type inference using\ntextual cues. That is, we formulate the type inference problem as a classification problem which uses the textual features\nin the source code to predict the type of variables. In this\napproach, a classifier learns a model to distinguish between\ntypes of variables in a program. The model is subsequently\nused to (approximately) infer the types of other variables.

\n\n

We evaluate the feasibility of this approach on four Java\nprojects wherein type information is already available in the\nsource code and can be used to train and test a classifier. Our\nexperiments show this approach can predict the type of new\nvariables with relatively high accuracy (80% F-measure).\nThese results suggest that textual cues can be\ncomplementary\ntools in inferring types for dynamic languages.

\n", "tags": ["information extraction"], "tsne_embedding": [-1.2419489622116089, 29.08860206604004]}, {"key": "shrivastava2020on-the-fly", "year": "2020", "title": "On-the-Fly Adaptation of Source Code Models using Meta-Learning", "abstract": "

The ability to adapt to unseen, local contexts is an important challenge that successful models of source code must overcome. One of the most popular approaches for the adaptation of such models is dynamic evaluation. With dynamic evaluation, when running a model on an unseen file, the model is updated immediately after having observed each token in that file. In this work, we propose instead to frame the problem of context adaptation as a meta-learning problem. We aim to train a base source code model that is best able to learn from information in a file to deliver improved predictions of missing tokens. Unlike dynamic evaluation, this formulation allows us to select more targeted information (support tokens) for adaptation, that is both before and after a target hole in a file. We consider an evaluation setting that we call line-level maintenance, designed to reflect the downstream task of code auto-completion in an IDE. Leveraging recent developments in meta-learning such as first-order MAML and Reptile, we demonstrate improved performance in experiments on a large scale Java GitHub corpus, compared to other adaptation baselines including dynamic evaluation. Moreover, our analysis shows that, compared to a non-adaptive baseline, our approach improves performance on identifiers and literals by 44% and 15%, respectively.

\n", "tags": ["language model", "autocomplete"], "tsne_embedding": [0.3279370963573456, -3.7980923652648926]}, {"key": "shrivastava2020repository", "year": "2022", "title": "Repository-Level Prompt Generation for Large Language Models of Code", "abstract": "

With the success of large language models (LLMs) of code and their use as code assistants (e.g. Codex used in GitHub Copilot), techniques for introducing domain-specific knowledge in the prompt design process become important. In this work, we propose a framework called Repo-Level Prompt Generator that learns to generate example-specific prompts using a set of rules. These rules take context from the entire repository, thereby incorporating both the structure of the repository and the context from other relevant files (e.g. imports, parent class files). Our technique doesn\u2019t require any access to the weights of the LLM, making it applicable in cases where we only have black-box access to the LLM. We conduct experiments on the task of single-line code-autocompletion using code repositories taken from Google Code archives. We demonstrate that an oracle constructed from our proposed rules gives up to 36% relative improvement over Codex, showing the quality of the rules. Further, we show that when we train a model to select the best rule, we can achieve significant performance gains over Codex. The code for our work can be found at: https://github.com/shrivastavadisha/repo_level_prompt_generation .

\n", "tags": ["Transformer", "code completion"], "tsne_embedding": [2.3373360633850098, -3.2808141708374023]}, {"key": "shrivastava2023repofusion", "year": "2023", "title": "RepoFusion: Training Code Models to Understand Your Repository", "abstract": "

Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi ($\\sim73\\times$ larger) and closely match the performance of the $\\sim 70\\times$ larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at \\url{https://huggingface.co/RepoFusion}.

\n", "tags": ["completion"], "tsne_embedding": [2.1906557083129883, -3.663079023361206]}, {"key": "shuai2020improving", "year": "2020", "title": "Improving Code Search with Co-Attentive Representation Learning", "abstract": "

Searching and reusing existing code from a large-scale codebase, e.g, GitHub, can help developers complete a programming task efficiently. Recently, Gu et al. proposed a deep learning-based model (i.e., DeepCS), which significantly outperformed prior models. The DeepCS embedded codebase and natural language queries into vectors by two LSTM (long and short-term memory) models separately, and returned developers the code with higher similarity to a code search query. However, such embedding method learned two isolated representations for code and query but ignored their internal semantic correlations. As a result, the learned isolated representations of code and query may limit the effectiveness of code search.

\n\n

To address the aforementioned issue, we propose a co-attentive representation learning model, i.e., Co-Attentive Representation Learning Code Search-CNN (CARLCS-CNN). CARLCS-CNN learns interdependent representations for the embedded code and query with a co-attention mechanism. Generally, such mechanism learns a correlation matrix between embedded code and query, and co-attends their semantic relationship via row/column-wise max-pooling. In this way, the semantic correlation between code and query can directly affect their individual representations. We evaluate the effectiveness of CARLCS-CNN on Gu et al.\u2019s dataset with 10k queries. Experimental results show that the proposed CARLCS-CNN model significantly outperforms DeepCS by 26.72% in terms of MRR (mean reciprocal rank). Additionally, CARLCS-CNN is five times faster than DeepCS in model training and four times in testing.

\n", "tags": ["search"], "tsne_embedding": [-0.7768654227256775, -15.887958526611328]}, {"key": "si2018learning", "year": "2018", "title": "Learning Loop Invariants for Program Verification", "abstract": "

A fundamental problem in program verification concerns inferring loop invariants.\nThe problem is undecidable and even practical instances are challenging. Inspired\nby how human experts construct loop invariants, we propose a reasoning framework\nCODE2INV\nthat constructs the solution by multi-step decision making and querying\nan external program graph memory block. By training with reinforcement learning,\nCODE2INV\ncaptures rich program features and avoids the need for ground truth\nsolutions as supervision. Compared to previous learning tasks in domains with\ngraph-structured data, it addresses unique challenges, such as a binary objective\nfunction and an extremely sparse reward that is given by an automated theorem\nprover only after the complete loop invariant is proposed. We evaluate\nCODE2INV on\na suite of 133 benchmark problems and compare it to three state-of-the-art systems.\nIt solves 106 problems compared to 73 by a stochastic search-based system, 77 by\na heuristic search-based system, and 100 by a decision tree learning-based system.\nMoreover, the strategy learned can be generalized to new programs: compared to\nsolving new instances from scratch, the pre-trained agent is more sample efficient\nin finding solutions.

\n", "tags": ["program analysis", "verification"], "tsne_embedding": [8.09742546081543, 10.504767417907715]}, {"key": "silavong2022senatus", "year": "2022", "title": "Senatus - A Fast and Accurate Code-to-Code Recommendation Engine", "abstract": "

Machine learning on source code (MLOnCode) is a popular research field that has been driven by the availability of large-scale code repositories and the development of powerful probabilistic and deep learning models for mining source code. Code-to-code recommendation is a task in MLOnCode that aims to recommend relevant, diverse and concise code snippets that usefully extend the code currently being written by a developer in their development environment (IDE). Code-to-code recommendation engines hold the promise of increasing developer productivity by reducing context switching from the IDE and increasing code-reuse. Existing code-to-code recommendation engines do not scale gracefully to large codebases, exhibiting a linear growth in query time as the code repository increases in size. In addition, existing code-to-code recommendation engines fail to account for the global statistics of code repositories in the ranking function, such as the distribution of code snippet lengths, leading to sub-optimal retrieval results. We address both of these weaknesses with Senatus, a new code-to-code recommendation engine. At the core of Senatus is De-Skew LSH a new locality sensitive hashing (LSH) algorithm that indexes the data for fast (sub-linear time) retrieval while also counteracting the skewness in the snippet length distribution using novel abstract syntax tree-based feature scoring and selection algorithms. We evaluate Senatus and find the recommendations to be of higher quality than competing baselines, while achieving faster search. For example on the CodeSearchNet dataset Senatus improves performance by 31.21% F1 and 147.9x faster query time compared to Facebook Aroma. Senatus also outperforms standard MinHash LSH by 29.2% F1 and 51.02x faster query time.

\n", "tags": ["code similarity", "search"], "tsne_embedding": [-5.0657639503479, -17.909690856933594]}, {"key": "silva2023repairllama", "year": "2023", "title": "RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair", "abstract": "

Automated Program Repair (APR) has evolved significantly with the advent of Large Language Models (LLMs). Fine-tuning LLMs for program repair is a recent avenue of research, with many dimensions which have not been explored. Existing work mostly fine-tunes LLMs with naive code representations and is fundamentally limited in its ability to fine-tune larger LLMs. To address this problem, we propose RepairLLaMA, a novel program repair approach that combines 1) code representations for APR and 2) the state-of-the-art parameter-efficient LLM fine-tuning technique called LoRA. This results in RepairLLaMA producing a highly effective `program repair adapter\u2019 for fixing bugs with language models. Our experiments demonstrate the validity of both concepts. First, fine-tuning adapters with program repair specific code representations enables the model to use meaningful repair signals. Second, parameter-efficient fine-tuning helps fine-tuning to converge and contributes to the effectiveness of the repair adapter to fix data-points outside the fine-tuning data distribution. Overall, RepairLLaMA correctly fixes 125 Defects4J v2 and 82 HumanEval-Java bugs, outperforming all baselines.

\n", "tags": ["repair"], "tsne_embedding": [21.054920196533203, 1.3900023698806763]}, {"key": "singh2016question", "year": "2016", "title": "Question Independent Grading using Machine Learning: The Case of Computer Program Grading", "abstract": "

Learning supervised models to grade open-ended responses is an expensive process. A model has to be trained for every prompt/question separately, which in turn requires graded samples. In automatic programming evaluation specifically, the focus of this work, this issue is amplified. The models have to be trained not only for every question but also for every language the question is offered in. Moreover, the availability and time taken by experts to create a labeled set of programs for each question is a major bottleneck in scaling such a system. We address this issue by presenting a method to grade computer programs which requires no manually assigned labeled samples for grading responses to a new, unseen question. We extend our previous work (by Srikant, Aggarwal; KDD 2014) wherein we introduced a grammar of features to learn question specific models. In this work, we propose a method to transform those features into a set of features that maintain their structural relation with the labels across questions. Using these features we learn one supervised model, across questions for a given language, which can then be applied to an ungraded response to an unseen question. We show that our method rivals the performance of both, question specific models and the consensus among human experts while substantially outperforming extant ways of evaluating codes. We demonstrate the system single s value by deploying it to grade programs in a high stakes assessment. The learning from this work is transferable to other grading tasks such as math question grading and also provides a new variation to the supervised learning approach.

\n", "tags": ["education"], "tsne_embedding": [-12.69198989868164, 17.44247817993164]}, {"key": "siow2019core", "year": "2019", "title": "CORE: Automating Review Recommendation for Code Changes", "abstract": "

Code review is a common process that is used by developers, in which a reviewer provides useful comments or points out defects in the submitted source code changes via pull request. Code review has been widely used for both industry and open-source projects due to its capacity in early defect identification, project maintenance, and code improvement. With rapid updates on project developments, code review becomes a non-trivial and labor-intensive task for reviewers. Thus, an automated code review engine can be beneficial and useful for project development in practice. Although there exist prior studies on automating the code review process by adopting static analysis tools or deep learning techniques, they often require external sources such as partial or full source code for accurate review suggestion. In this paper, we aim at automating the code review process only based on code changes and the corresponding reviews but with better performance. The hinge of accurate code review suggestion is to learn good representations for both code changes and reviews. To achieve this with limited source, we design a multi-level embedding (i.e., word embedding and character embedding) approach to represent the semantics provided by code changes and reviews. The embeddings are then well trained through a proposed attentional deep learning model, as a whole named CORE. We evaluate the effectiveness of CORE on code changes and reviews collected from 19 popular Java projects hosted on Github. Experimental results show that our model CORE can achieve significantly better performance than the state-of-the-art model (DeepMem), with an increase of 131.03% in terms of Recall@10 and 150.69% in terms of Mean Reciprocal Rank. Qualitative general word analysis among project developers also demonstrates the performance of CORE in automating code review.

\n", "tags": ["review"], "tsne_embedding": [-8.25676441192627, 2.0948007106781006]}, {"key": "siow2022learning", "year": "2022", "title": "Learning Program Semantics with Code Representations: An Empirical Study", "abstract": "

Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-art performance. However, currently, a comprehensive and systematic study on evaluating different program representation techniques across diverse tasks is still missed.

\n\n

From this starting point, in this paper, we conduct an empirical study to evaluate different program representation techniques. Specifically, we categorize current mainstream code representation techniques into four categories i.e., Feature-based, Sequence-based, Tree-based, and Graph-based program representation technique and evaluate its performance on three diverse and popular code intelligent tasks i.e., {Code Classification}, Vulnerability Detection, and Clone Detection on the public released benchmark. We further design three {research questions (RQs)} and conduct a comprehensive analysis to investigate the performance. By the extensive experimental results, we conclude that (1) The graph-based representation is superior to the other selected techniques across these tasks. (2) Compared with the node type information used in tree-based and graph-based representations, the node textual information is more critical to learning the program semantics. (3) Different tasks require the task-specific semantics to achieve their highest performance, however combining various program semantics from different dimensions such as control dependency, data dependency can still produce promising results.

\n", "tags": ["representation"], "tsne_embedding": [0.8176686763763428, 9.674466133117676]}, {"key": "sivaraman2021mining", "year": "2021", "title": "Mining Idioms in the Wild", "abstract": "

Existing code repositories contain numerous instances of code patterns that are idiomatic ways of accomplishing a particular programming task. Sometimes, the programming language in use supports specific operators or APIs that can express the same idiomatic imperative code much more succinctly. However, those code patterns linger in repositories because the developers may be unaware of the new APIs or have not gotten around to them. Detection of idiomatic code can also point to the need for new APIs.

\n\n

We share our experiences in mine idiomatic patterns from the Hack repo at Facebook. We found that existing techniques either cannot identify meaningful patterns from syntax trees or require test-suite-based dynamic analysis to incorporate semantic properties to mine useful patterns. The key insight of the approach proposed in this paper \u2013 Jezero \u2013 is that semantic idioms from a large codebase can be learned from canonicalized dataflow trees. We propose a scalable, lightweight static analysis-based approach to construct such a tree that is well suited to mine semantic idioms using nonparametric Bayesian methods.

\n\n

Our experiments with Jezero on Hack code shows a clear advantage of adding canonicalized dataflow information to ASTs: Jezero was significantly more effective than a baseline that did not have the dataflow augmentation in being able to effectively find refactoring opportunities from unannotated legacy code.

\n", "tags": ["pattern mining", "refactoring"], "tsne_embedding": [11.446783065795898, -13.309419631958008]}, {"key": "souza2023lexecutor", "year": "2023", "title": "LExecutor: Learning-Guided Execution", "abstract": "

Executing code is essential for various program analysis tasks, e.g., to detect bugs that manifest through exceptions or to obtain execution traces for further dynamic analysis. However, executing an arbitrary piece of code is often difficult in practice, e.g., because of missing variable definitions, missing user inputs, and missing third-party dependencies. This paper presents LExecutor, a learning-guided approach for executing arbitrary code snippets in an underconstrained way. The key idea is to let a neural model predict missing values that otherwise would cause the program to get stuck, and to inject these values into the execution. For example, LExecutor injects likely values for otherwise undefined variables and likely return values of calls to otherwise missing functions. We evaluate the approach on Python code from popular open-source projects and on code snippets extracted from Stack Overflow. The neural model predicts realistic values with an accuracy between 80.1% and 94.2%, allowing LExecutor to closely mimic real executions. As a result, the approach successfully executes significantly more code than any available technique, such as simply executing the code as-is. For example, executing the open-source code snippets as-is covers only 4.1% of all lines, because the code crashes early on, whereas LExecutor achieves a coverage of 50.1%.

\n\n", "tags": ["execution"], "tsne_embedding": [10.042441368103027, 9.588167190551758]}, {"key": "spirin2021psiminer", "year": "2021", "title": "PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code", "abstract": "

The application of machine learning algorithms to source code has grown in the past years. Since these algorithms are quite sensitive to input data, it is not surprising that researchers experiment with input representations. Nowadays, a popular starting point to represent code is abstract syntax trees (ASTs). Abstract syntax trees have been used for a long time in various software engineering domains, and in particular in IDEs. The API of modern IDEs allows to manipulate and traverse ASTs, resolve references between code elements, etc. Such algorithms can enrich ASTs with new data and therefore may be useful in ML-based code analysis. In this work, we present PSIMINER\u2014 a tool for processing PSI trees from the IntelliJ Platform. PSI trees contain code syntax trees as well as functions to work with them, and therefore can be used to enrich code representation using static analysis algorithms of modern IDEs. To showcase this idea, we use our tool to infer types of identifiers in Java ASTs and extend the code2seq model for the method name prediction problem.

\n", "tags": ["tool"], "tsne_embedding": [9.013425827026367, -14.487183570861816]}, {"key": "srikant2014system", "year": "2014", "title": "A system to grade computer programming skills using machine learning", "abstract": "

The automatic evaluation of computer programs is a nascent area of research with a potential for large-scale impact. Extant program assessment systems score mostly based on the number of test-cases passed, providing no insight into the competency of the programmer. In this paper, we present a system to grade computer programs automatically. In addition to grading a program on its programming practices and complexity, the key kernel of the system is a machine-learning based algorithm which determines closeness of the logic of the given program to a correct program. This algorithm uses a set of highly-informative features, derived from the abstract representations of a given program, that capture the program\u2019s functionality. These features are then used to learn a model to grade the programs, which are built against evaluations done by experts. We show that the regression models provide much better grading than the ubiquitous test-case-pass based grading and rivals the grading accuracy of other open-response problems such as essay grading . We also show that our novel features add significant value over and above basic keyword/expression count features. In addition to this, we propose a novel way of posing computer-program grading as a one-class modeling problem and report encouraging preliminary results. We show the value of the system through a case study in a real-world industrial deployment. To the best of the authors\u2019 knowledge, this is the first time a system using machine learning has been developed and used for grading programs. The work is timely with regard to the recent boom in Massively Online Open Courseware (MOOCs), which promises to produce a significant amount of hand-graded digitized data.

\n", "tags": ["education"], "tsne_embedding": [-12.453431129455566, 17.697343826293945]}, {"key": "sun2019grammar", "year": "2019", "title": "A Grammar-Based Structural CNN Decoder for Code Generation", "abstract": "

Code generation maps a program description to executable\nsource code in a programming language. Existing approaches\nmainly rely on a recurrent neural network (RNN) as the decoder. However, we find that a program contains significantly\nmore tokens than a natural language sentence, and thus it may\nbe inappropriate for RNN to capture such a long sequence. In\nthis paper, we propose a grammar-based structural convolutional neural network (CNN) for code generation. Our model\ngenerates a program by predicting the grammar rules of the\nprogramming language; we design several CNN modules, including the tree-based convolution and pre-order convolution,\nwhose information is further aggregated by dedicated attentive pooling layers. Experimental results on the HearthStone\nbenchmark dataset show that our CNN code generator significantly outperforms the previous state-of-the-art method by 5\npercentage points; additional experiments on several semantic parsing tasks demonstrate the robustness of our model. We\nalso conduct in-depth ablation test to better understand each\ncomponent of our model.

\n", "tags": ["code generation", "grammar"], "tsne_embedding": [-11.033442497253418, 4.547600269317627]}, {"key": "sun2020pscs", "year": "2020", "title": "PSCS: A Path-based Neural Model for Semantic Code Search", "abstract": "

To obtain code snippets for reuse, programmers prefer to search for related documents, e.g., blogs or Q&A, instead of code itself. The major reason is due to the semantic diversity and mismatch between queries and code snippets. Deep learning models have been proposed to address this challenge. Compared with approaches using information retrieval techniques, deep learning models do not suffer from the information loss caused by refining user intention into keywords. However, the performance of previous works is not satisfactory because they ignore the importance of code structure. When the semantics of code (e.g., identifier names, APIs) are ambiguous, code structure may be the only feature for the model to utilize. In that case, previous works relearn the structural information from lexical tokens of code, which is extremely difficult for a model without any domain knowledge. In this work, we propose PSCS, a path-based neural model for semantic code search. Our model encodes both the semantics and structures of code represented by AST paths. We train and evaluate our model over 330k-19k query-function pairs, respectively. The evaluation results demonstrate that PSCS achieves a SuccessRate of 47.6% and a Mean Reciprocal Rank (MRR) of 30.4% when considering the top-10 results with a match. The proposed approach significantly outperforms both DeepCS, the first approach that applies deep learning to code search task, and CARLCS, a state-of-the-art approach that introduces a co-attentive representation learning model on the basis of DeepCS. The importance of code structure is demonstrated with an ablation study on code features, which enlightens model design for further studies.

\n", "tags": ["grammar", "search"], "tsne_embedding": [-0.6917613744735718, -16.10127067565918]}, {"key": "svyatkovskiy2019pythia", "year": "2019", "title": "Pythia: AI-assisted Code Completion System", "abstract": "

In this paper, we propose a novel end-to-end approach for AI-assisted code completion called Pythia. It generates ranked lists of method and API recommendations which can be used by software developers at edit time. The system is currently deployed as part of Intellicode extension in Visual Studio Code IDE. Pythia exploits state-of-the-art large-scale deep learning models trained on code contexts extracted from abstract syntax trees. It is designed to work at a high throughput predicting the best matching code completions on the order of 100 ms.

\n\n

We describe the architecture of the system, perform comparisons to frequency-based approach and invocation-based Markov Chain language model, and discuss challenges serving Pythia models on lightweight client devices.

\n\n

The offline evaluation results obtained on 2700 Python open source software GitHub repositories show a top-5 accuracy of 92%, surpassing the baseline models by 20% averaged over classes, for both intra and cross-project settings.

\n\n", "tags": ["autocomplete", "language model"], "tsne_embedding": [-5.5462236404418945, 5.799919128417969]}, {"key": "svyatkovskiy2020fast", "year": "2020", "title": "Fast and Memory-Efficient Neural Code Completion", "abstract": "

Code completion is one of the most widely used features of modern integrated development environments (IDEs). Deep learning has recently made significant progress in the statistical prediction of source code. However, state-of-the-art neural network models consume prohibitively large amounts of memory, causing computational burden to the development environment, especially when deployed in lightweight client devices.

\n\n

In this work, we reframe neural code completion from a generation task to a task of learning to rank the valid completion suggestions computed from static analyses. By doing so, we are able to design and test a variety of deep neural network model configurations. One of our best models consumes 6 MB of RAM, computes a single suggestion in 8 ms, and achieves 90% recall in its top five suggestions. Our models outperform standard language modeling code completion techniques in terms of predictive performance, computational speed, and memory efficiency. Furthermore, they learn about code semantics from the natural language aspects of the code (e.g. identifier names) and can generalize better to previously unseen code.

\n", "tags": ["autocomplete"], "tsne_embedding": [-6.100874900817871, 5.224014759063721]}, {"key": "svyatkovskiy2020intellicode", "year": "2020", "title": "IntelliCode Compose: Code Generation Using Transformer", "abstract": "

In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments.\nIn this paper, we introduce IntelliCode Compose \u2212 a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook.\nOur best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language.

\n", "tags": ["autocomplete", "code generation", "synthesis", "language model", "pretraining"], "tsne_embedding": [-12.803207397460938, -14.306133270263672]}, {"key": "szafraniec2022code", "year": "2022", "title": "Code Translation with Compiler Representations", "abstract": "

In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java - Rust pair. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.

\n", "tags": ["Transformer", "migration", "decompilation"], "tsne_embedding": [3.5251095294952393, 5.78542947769165]}, {"key": "tabassum2020code", "year": "2020", "title": "Code and Named Entity Recognition in StackOverflow", "abstract": "

There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F-1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F1 score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model.

\n", "tags": ["dataset", "information extraction"], "tsne_embedding": [-7.392780780792236, -6.3635101318359375]}, {"key": "tan2024llm4decompile", "year": "2024", "title": "LLM4Decompile: Decompiling Binary Code with Large Language Models", "abstract": "

Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at this https URL

\n", "tags": ["decompilation", "translation", "evaluation", "large language models", "LLM"], "tsne_embedding": [13.293211936950684, 14.171186447143555]}, {"key": "tarlow2019learning", "year": "2019", "title": "Learning to Fix Build Errors with Graph2Diff Neural Networks", "abstract": "

Professional software developers spend a significant amount oftime fixing builds, but this has received little attention as a prob-lem in automatic program repair. We present a new deep learningarchitecture, called Graph2Diff, for automatically localizing andfixing build errors. We represent source code, build configurationfiles, and compiler diagnostic messages as a graph, and then use aGraph Neural Network model to predict a diff. A diff specifies howto modify the code\u2019s abstract syntax tree, represented in the neuralnetwork as a sequence of tokens and of pointers to code locations.Our network is an instance of a more general abstraction which wecall Graph2Tocopo, which is potentially useful in any developmenttool for predicting source code changes. We evaluate the model ona dataset of over 500k real build errors and their resolutions fromprofessional developers. Compared to the approach of DeepDelta, our approach tackles the harder task of predicting a moreprecise diff but still achieves over double the accuracy.

\n", "tags": ["edit", "repair"], "tsne_embedding": [15.741257667541504, -1.308530569076538]}, {"key": "theeten2019import2vec", "year": "2019", "title": "Import2vec - Learning Embeddings for Software Libraries", "abstract": "

We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning tasks (e.g. classification) or applications such as contextual search and analogical reasoning.

\n\n

We apply word embedding techniques from natural language processing (NLP) to train embeddings for library packages (\u201clibrary vectors\u201d). Library vectors represent libraries by similar context of use as determined by import statements present in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python).

\n", "tags": ["representation"], "tsne_embedding": [5.832681655883789, -15.24854850769043]}, {"key": "tian2020evaluating", "year": "2020", "title": "Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair", "abstract": "

A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study the benefit of learning code representations to learn deep features that may encode the properties of patch correctness. Our work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings associated with logistic regression yielded an AUC value of about 0.8 in predicting patch correctness on a deduplicated dataset of 1000 labeled patches. Our study shows that learned representations can lead to reasonable performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. These representations may further be complementary to features that were carefully (manually) engineered in the literature.

\n", "tags": ["repair", "Transformer"], "tsne_embedding": [16.796730041503906, 1.1100424528121948]}, {"key": "tian2024debugbench", "year": "2024", "title": "DebugBench: Evaluating Debugging Capability of Large Language Models", "abstract": "

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs\u2019 debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench\u2019, an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and three open-source models in a zero-shot scenario. We find that (1) while closed-source models like GPT-4 exhibit inferior debugging performance compared to humans, open-source models such as Code Llama fail to attain any pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

\n", "tags": ["repair"], "tsne_embedding": [17.426408767700195, 9.845132827758789]}, {"key": "tomczak2019simulating", "year": "2019", "title": "Simulating Execution Time of Tensor Programs using Graph Neural Networks", "abstract": "

Optimizing the execution time of tensor program, e.g., a convolution, involves finding its optimal configuration. Searching the configuration space exhaustively is typically infeasible in practice. In line with recent research using TVM, we propose to learn a surrogate model to overcome this issue. The model is trained on an acyclic graph called an abstract syntax tree, and utilizes a graph convolutional network to exploit structure in the graph. We claim that a learnable graph-based data processing is a strong competitor to heuristic-based feature extraction. We present a new dataset of graphs corresponding to configurations and their execution time for various tensor programs. We provide baselines for a runtime prediction task.

\n", "tags": ["GNN"], "tsne_embedding": [-0.6929211020469666, 15.298480033874512]}, {"key": "tran2019recovering", "year": "2019", "title": "Recovering Variable Names for Minified Code with Usage Contexts", "abstract": "

In modern Web technology, JavaScript (JS) code plays an important role. To avoid the exposure of original source code, the variable names in JS code deployed in the wild are often replaced by short, meaningless names, thus making the code extremely difficult to manually understand and analysis. This paper presents JSNeat, an information retrieval (IR)-based approach to recover the variable names in minified JS code. JSNeat follows a data-driven approach to recover names by searching for them in a large corpus of open-source JS code. We use three types of contexts to match a variable in given minified code against the corpus including the context of properties and roles of the variable, the context of that variable and relations with other variables under recovery, and the context of the task of the function to which the variable contributes. We performed several empirical experiments to evaluate JSNeat on the dataset of more than 322K JS files with 1M functions, and 3.5M variables with 176K unique variable names. We found that JSNeat achieves a high accuracy of 69.1%, which is the relative improvements of 66.1% and 43% over two state-of-the-art approaches JSNice and JSNaughty, respectively. The time to recover for a file or for a variable with JSNeat is twice as fast as with JSNice and 4x as fast as with JNaughty, respectively.

\n", "tags": ["naming", "deobfuscation"], "tsne_embedding": [18.00760269165039, 19.820934295654297]}, {"key": "tu2014localness", "year": "2014", "title": "On the Localness of Software", "abstract": "

The n-gram language model, which has its roots in statistical natural\nlanguage processing, has been shown to successfully capture the\nrepetitive and predictable regularities (\u201cnaturalness\u201d) of source code,\nand help with tasks such as code suggestion, porting, and designing\nassistive coding devices. However, we show in this paper that this\nnatural-language-based model fails to exploit a special property of\nsource code: localness. We find that human-written programs are\nlocalized: they have useful local regularities that can be captured\nand exploited. We introduce a novel cache language model that\nconsists of both an n-gram and an added \u201ccache\u201d component to\nexploit localness. We show empirically that the additional cache\ncomponent greatly improves the n-gram approach by capturing\nthe localness of software, as measured by both cross-entropy and\nsuggestion accuracy. Our model\u2019s suggestion accuracy is actually\ncomparable to a state-of-the-art, semantically augmented language\nmodel; but it is simpler and easier to implement. Our cache language\nmodel requires nothing beyond lexicalization, and thus is applicable\nto all programming languages.

\n", "tags": ["language model"], "tsne_embedding": [-10.87387752532959, -18.682952880859375]}, {"key": "tufano2018deep", "year": "2018", "title": "Deep Learning Similarities from Different Representations of Source Code", "abstract": "

Assessing the similarity between code components plays a pivotal\nrole in a number of Software Engineering (SE) tasks, such as clone\ndetection, impact analysis, refactoring, etc. \nCode similarity is generally measured by relying on manually defined or hand-crafted\nfeatures, e.g., by analyzing the overlap among identifiers or comparing the Abstract Syntax Trees of two code components. These\nfeatures represent a best guess at what SE researchers can utilize to\nexploit and reliably assess code similarity for a given task. Recent\nwork has shown, when using a stream of identifiers to represent\nthe code, that Deep Learning (DL) can effectively replace manual\nfeature engineering for the task of clone detection. However, source\ncode can be represented at different levels of abstraction: identifiers, Abstract Syntax Trees, Control Flow Graphs, and Bytecode.\nWe conjecture that each code representation can provide a different,\nyet orthogonal view of the same code fragment, thus, enabling a\nmore reliable detection of similarities in code. In this paper, we\ndemonstrate how SE tasks can benefit from a DL-based approach,\nwhich can automatically learn code similarities from different representations.

\n", "tags": ["representation", "clone"], "tsne_embedding": [3.082167863845825, -8.634384155273438]}, {"key": "tufano2018empirical", "year": "2018", "title": "An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation", "abstract": "

Millions of open-source projects with numerous bug fixes are available in code repositories. This proliferation of software development histories can be leveraged to learn how to fix common programming bugs. To explore such a potential, we perform an empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects. First, we mine millions of bug-fixes from the change histories of projects hosted on GitHub, in order to extract meaningful examples of such bug-fixes. Next, we abstract the buggy and corresponding fixed code, and use them to train an Encoder-Decoder model able to translate buggy code into its fixed version. In our empirical investigation we found that such a model is able to fix thousands of unique buggy methods in the wild. Overall, this model is capable of predicting fixed patches generated by developers in 9-50% of the cases, depending on the number of candidate patches we allow it to generate. Also, the model is able to emulate a variety of different Abstract Syntax Tree operations and generate candidate patches in a split second.

\n", "tags": ["repair"], "tsne_embedding": [19.068561553955078, 0.36664262413978577]}, {"key": "tufano2018learning", "year": "2018", "title": "Learning How to Mutate Source Code from Bug-Fixes", "abstract": "

Mutation testing has been widely accepted as an approach to guide test case generation or to assess the effectiveness of test suites. Empirical studies have shown that mutants are representative of real faults; yet they also indicated a clear need for better, possibly customized, mutation operators and strategies. While some recent papers have tried to devise domain-specific or general purpose mutator operators by manually analyzing real faults, such an activity is effort- (and error-) prone and does not deal with an important practical question as to how to really mutate a given source code element. We propose a novel approach to automatically learn mutants from faults in real programs. First, our approach processes bug fixing changes using fine-grained differencing, code abstraction, and change clustering. Then, it learns mutation models using a deep learning strategy. We have trained and evaluated our technique on a set of ~787k bugs mined from GitHub. Starting from code fixed by developers in the context of a bug-fix, our empirical evaluation showed that our models are able to predict mutants that resemble original fixed bugs in between 9% and 45% of the cases (depending on the model). Moreover, over 98% of the automatically generated mutants are lexically and syntactically correct.

\n", "tags": ["repair", "edit"], "tsne_embedding": [20.100252151489258, 3.7861876487731934]}, {"key": "tufano2019learning", "year": "2019", "title": "On Learning Meaningful Code Changes via Neural Machine Translation", "abstract": "

Recent years have seen the rise of Deep Learning (DL) techniques applied to source code. Researchers have exploited DL to automate several development and maintenance tasks, such as writing commit messages, generating comments and detecting vulnerabilities among others. One of the long lasting dreams of applying DL to code is the possibility to automate non-trivial coding activities. While some steps in this direction have been taken (e.g., learning how to fix bugs), there is still a lack of empirical evidence on the types of code changes that can be learned and automatically applied by DL. Our goal is to make this first step by quantitatively and qualitatively investigating the ability of a Neural Machine Translation (NMT) model to learn how to automatically apply code changes implemented by developers during pull requests. We train and experiment with the NMT model on a set of 236k pairs of code components before and after the implementation of the changes provided in the pull requests. We show that, when applied in a narrow enough context (i.e., small/medium-sized pairs of methods before/after the pull request changes), NMT can automatically replicate the changes implemented by developers during pull requests in up to 36% of the cases. Moreover, our qualitative analysis shows that the model is capable of learning and replicating a wide variety of meaningful code changes, especially refactorings and bug-fixing activities. Our results pave the way to novel research in the area of DL on code, such as the automatic learning and applications of refactoring.

\n", "tags": ["repair", "edit"], "tsne_embedding": [-13.572861671447754, 3.0538220405578613]}, {"key": "tufano2020generating", "year": "2020", "title": "Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers", "abstract": "

Unit testing represents the foundational basis of the software testing pyramid, beneath integration and end-to-end testing. Automated software testing researchers have proposed a variety of techniques to assist developers in this time-consuming task. In this paper we present an approach to support developers in writing unit test cases by generating accurate and useful assert statements. Our approach is based on a state-of-the-art transformer model initially pretrained on an English textual corpus. This semantically rich model is then trained in a semi-supervised fashion on a large corpus of source code. Finally, we finetune this model on the task of generating assert statements for unit tests. The resulting model is able to generate accurate assert statements for a given method under test. In our empirical evaluation, the model was able to predict the exact assert statements written by developers in 62% of the cases in the first attempt. The results show 80% relative improvement for top-1 accuracy over the previous RNN-based approach in the literature. We also show the substantial impact of the pretraining process on the performances of our model, as well as comparing it with assert auto-completion task. Finally, we demonstrate how our approach can be used to augment EvoSuite test cases, with additional asserts leading to improved test coverage.

\n", "tags": ["code generation", "synthesis", "test generation"], "tsne_embedding": [-16.79653549194336, 11.296119689941406]}, {"key": "tufano2020unit", "year": "2020", "title": "Unit Test Case Generation with Transformers", "abstract": "

Automated Unit Test Case generation has been the focus of extensive literature within the research community. Existing approaches are usually guided by the test coverage criteria, generating synthetic test cases that are often difficult to read or understand for developers. In this paper we propose AthenaTest, an approach that aims at generating unit test cases by learning from real-world, developer-written test cases. Our approach relies on a state-of-the-art sequence-to-sequence transformer model which is able to write useful test cases for a given method under test (i.e., focal method). We also introduce methods2test - the largest publicly available supervised parallel corpus of unit test case methods and corresponding focal methods in Java, which comprises 630k test cases mined from 70k open-source repositories hosted on GitHub. We use this dataset to train a transformer model to translate focal methods into the corresponding test cases. We evaluate the ability of our model in generating test cases using natural language processing as well as code-specific criteria. First, we assess the quality of the translation compared to the target test case, then we analyze properties of the test case such as syntactic correctness and number and variety of testing APIs (e.g., asserts). We execute the test cases, collect test coverage information, and compare them with test cases generated by EvoSuite and GPT-3. Finally, we survey professional developers on their preference in terms of readability, understandability, and testing effectiveness of the generated test cases.

\n", "tags": ["code generation", "synthesis", "test generation"], "tsne_embedding": [-16.190946578979492, 11.351231575012207]}, {"key": "vaithilingam2022expectation", "year": "2022", "title": "Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models", "abstract": "

Recent advances in Large Language Models (LLM) have made automatic code generation possible for real-world programming tasks in\ngeneral-purpose programming languages such as Python. However,\nthere are few human studies on the usability of these tools and how\nthey fit the programming workflow. In this work, we conducted\na within-subjects user study with 24 participants to understand\nhow programmers use and perceive Copilot, a LLM-based code\ngeneration tool. We found that, while Copilot did not necessarily\nimprove the task completion time or success rate, most participants preferred to use Copilot in daily programming tasks, since\nCopilot often provided a useful starting point and saved the effort\nof searching online. However, participants did face difficulties in\nunderstanding, editing, and debugging code snippets generated\nby Copilot, which significantly hindered their task-solving effectiveness. Finally, we highlighted several promising directions for\nimproving the design of Copilot based on our observations and\nparticipants\u2019 feedback.

\n", "tags": ["human evaluation", "code generation", "language model"], "tsne_embedding": [7.31646203994751, -2.386716604232788]}, {"key": "vasic2019neural", "year": "2019", "title": "Neural Program Repair by Jointly Learning to Localize and Repair", "abstract": "

Due to its potential to improve programmer productivity and software quality, automated program repair has been an active topic of research. Newer techniques harness neural networks to learn directly from examples of buggy programs and their fixes. In this work, we consider a recently identified class of bugs called variable-misuse bugs. The state-of-the-art solution for variable misuse enumerates potential fixes for all possible bug locations in a program, before selecting the best prediction. We show that it is beneficial to train a model that jointly and directly localizes and repairs variable-misuse bugs. We present multi-headed pointer networks for this purpose, with one head each for localization and repair. The experimental results show that the joint model significantly outperforms an enumerative solution that uses a pointer based model for repair alone.

\n", "tags": ["repair", "program analysis", "variable misuse"], "tsne_embedding": [22.30829620361328, 0.5349947214126587]}, {"key": "vasilescu2017recovering", "year": "2017", "title": "Recovering Clear, Natural Identifiers from Obfuscated JS Names", "abstract": "

Well-chosen variable names are critical to source code readability, reusability, and maintainability. Unfortunately, in deployed JavaScript code (which is ubiquitous on the web) the identifier names are frequently minified and overloaded. This is done both for efficiency and also to protect potentially proprietary intellectual property. In this paper, we describe an approach based on statistical machine translation (SMT) that recovers some of the original names from the JavaScript programs minified by the very popular UglifyJS. This simple tool, Autonym, performs comparably to the best currently available deobfuscator for JavaScript, JSNice, which uses sophisticated static analysis. In fact, Autonym is quite complementary to JSNice, performing well when it does not, and vice versa. We also introduce a new tool, JSNaughty, which blends Autonym and JSNice, and significantly outperforms both at identifier name recovery, while remaining just as easy to use as JSNice. JSNaughty is available online at http://jsnaughty.org.

\n", "tags": ["deobfuscation", "naming"], "tsne_embedding": [18.18354606628418, 19.276578903198242]}, {"key": "villmow2021contest", "year": "2021", "title": "ConTest: A Unit Test Completion Benchmark featuring Context", "abstract": "

We introduce CONTEST, a benchmark for NLP-based unit test completion, the task of predicting a test\u2019s assert statements given its setup and focal method, i.e. the method to be tested. ConTest is large-scale (with 365k datapoints). Besides the test code and tested code, it also features context code called by either. We found context to be crucial for accurately predicting assertions. We also introduce baselines based on transformer encoder-decoders, and study the effects of including syntactic information and context. Overall, our models achieve a BLEU score of 38.2, while only generating unparsable code in 1.92% of cases.

\n", "tags": ["benchmark", "dataset", "verification", "Transformer"], "tsne_embedding": [-17.49346160888672, 10.9011812210083]}, {"key": "wan2018improving", "year": "2018", "title": "Improving Automatic Source Code Summarization via Deep Reinforcement Learning", "abstract": "

Code summarization provides a high level natural language description of the function performed by code, as it can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, most state-of-the-art approaches follow an encoder-decoder framework which encodes the code into a hidden space and then decode it into natural language space, suffering from two major drawbacks: a) Their encoders only consider the sequential content of code, ignoring the tree structure which is also critical for the task of code summarization; b) Their decoders are typically trained to predict the next word by maximizing the likelihood of next ground-truth word with previous ground-truth word given. However, it is expected to generate the entire sequence from scratch at test time. This discrepancy can cause an exposure bias issue, making the learnt decoder suboptimal. In this paper, we incorporate an abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network). The actor network provides the confidence of predicting the next word according to current state. On the other hand, the critic network evaluates the reward value of all possible extensions of the current state and can provide global guidance for explorations. We employ an advantage reward composed of BLEU metric to train both networks. Comprehensive experiments on a real-world dataset show the effectiveness of our proposed model when compared with some state-of-the-art methods.

\n", "tags": ["summarization", "documentation"], "tsne_embedding": [-16.909690856933594, -5.127161502838135]}, {"key": "wan2019multimodal", "year": "2019", "title": "Multi-Modal Attention Network Learning for Semantic Source Code Retrieval", "abstract": "

Code retrieval techniques and tools have been playing a key role in facilitating software developers to retrieve existing code fragments from available open-source repositories given a user query. Despite the existing efforts in improving the effectiveness of code retrieval, there are still two main issues hindering them from being used to accurately retrieve satisfiable code fragments from large-scale repositories when answering complicated queries. First, the existing approaches only consider shallow features of source code such as method names and code tokens, but ignoring structured features such as abstract syntax trees (ASTs) and control-flow graphs (CFGs) of source code, which contains rich and well-defined semantics of source code. Second, although the deep learning-based approach performs well on the representation of source code, it lacks the explainability, making it hard to interpret the retrieval results and almost impossible to understand which features of source code contribute more to the final results.

\n\n

To tackle the two aforementioned issues, this paper proposes MMAN, a novel Multi-Modal Attention Network for semantic source code retrieval. A comprehensive multi-modal representation is developed for representing unstructured and structured features of source code, with one LSTM for the sequential tokens of code, a Tree-LSTM for the AST of code and a GGNN (Gated Graph Neural Network) for the CFG of code. Furthermore, a multi-modal attention fusion layer is applied to assign weights to different parts of each modality of source code and then integrate them into a single hybrid representation. Comprehensive experiments and analysis on a large-scale real-world dataset show that our proposed model can accurately retrieve code snippets and outperforms the state-of-the-art methods.

\n", "tags": ["search"], "tsne_embedding": [-1.0626057386398315, -11.039911270141602]}, {"key": "wan2020naturalcc", "year": "2020", "title": "NaturalCC: A Toolkit to Naturalize the Source Code Corpus", "abstract": "

We present NaturalCC, an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis. Using NaturalCC, researchers both from natural language or programming language communities can quickly and easily reproduce the state-of-the-art baselines and implement their approach. NaturalCC is built upon Fairseq and PyTorch, providing (1) an efficient computation with multi-GPU and mixed-precision data processing for fast model training, (2) a modular and extensible framework that makes it easy to reproduce or implement an approach for big code analysis, and (3) a command line interface and a graphical user interface to demonstrate each model\u2019s performance. Currently, we have included several state-of-the-art baselines across different tasks (e.g., code completion, code comment generation, and code retrieval) for demonstration. The video of this demo is available at https://www.youtube.com/watch?v=q4W5VSI-u3E&t=25s.

\n", "tags": ["documentation", "search", "summarization"], "tsne_embedding": [2.3768067359924316, 0.17620176076889038]}, {"key": "wan2022what", "year": "2022", "title": "What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code", "abstract": "

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.

\n", "tags": ["Transformer", "pretraining", "program analysis"], "tsne_embedding": [-4.03557825088501, -3.1149017810821533]}, {"key": "wang2016automatically", "year": "2016", "title": "Automatically Learning Semantic Features for Defect Prediction", "abstract": "

Software defect prediction, which predicts defective code regions, can help developers find bugs and prioritize their testing efforts. To build accurate prediction models, previous\nstudies focus on manually designing features that encode the\ncharacteristics of programs and exploring different machine\nlearning algorithms. Existing traditional features often fail\nto capture the semantic differences of programs, and such a\ncapability is needed for building accurate prediction models.

\n\n

To bridge the gap between programs\u2019 semantics and\ndefect prediction features, this paper proposes to leverage a\npowerful representation-learning algorithm, deep learning,\nto learn semantic representation of programs automatically\nfrom source code. Specifically, we leverage Deep Belief\nNetwork (DBN) to automatically learn semantic features\nfrom token vectors extracted from programs\u2019 Abstract\nSyntax Trees (ASTs).

\n\n

Our evaluation on ten open source projects shows that\nour automatically learned semantic features significantly improve both within-project defect prediction (WPDP) and\ncross-project defect prediction (CPDP) compared to traditional features. Our semantic features improve WPDP on\naverage by 14.7% in precision, 11.5% in recall, and 14.2%\nin F1. For CPDP, our semantic features based approach\noutperforms the state-of-the-art technique TCA+ with traditional features by 8.9% in F1.

\n", "tags": ["defect", "representation"], "tsne_embedding": [13.865985870361328, 2.994459390640259]}, {"key": "wang2016bugram", "year": "2016", "title": "Bugram: bug detection with n-gram language models", "abstract": "

To improve software reliability, many rule-based techniques have been proposed to infer programming rules and detect violations of these rules as bugs. These rule-based approaches often rely on the highly frequent appearances of certain patterns in a project to infer rules. It is known that if a pattern does not appear frequently enough, rules are not learned, thus missing many bugs.

\n\n

In this paper, we propose a new approach\u2014Bugram\u2014that leverages n-gram language models instead of rules to detect bugs. Bugram models program tokens sequentially, using the n-gram language model. Token sequences from the program are then assessed according to their probability in the learned model, and low probability sequences are marked as potential bugs. The assumption is that low probability token sequences in a program are unusual, which may indicate bugs, bad practices, or unusual/special uses of code of which developers may want to be aware.

\n\n

We evaluate Bugram in two ways. First, we apply Bugram on the latest versions of 16 open source Java projects. Results show that Bugram detects 59 bugs, 42 of which are manually verified as correct, 25 of which are true bugs and 17 are code snippets that should be refactored. Among the 25 true bugs, 23 cannot be detected by PR-Miner. We have reported these bugs to developers, 7 of which have already been confirmed by developers (4 of them have already been fixed), while the rest await confirmation. Second, we further compare Bugram with three additional graph- and rule-based bug detection tools, i.e., JADET, Tikanga, and GrouMiner. We apply Bugram on 14 Java projects evaluated in these three studies. Bugram detects 21 true bugs, at least 10 of which cannot be detected by these three tools. Our results suggest that Bugram is complementary to existing rule-based bug detection approaches.

\n\n", "tags": ["defect", "representation"], "tsne_embedding": [20.289419174194336, 7.893535137176514]}, {"key": "wang2016neural", "year": "2016", "title": "Neural Code Completion", "abstract": "

Code completion, an essential part of modern software development, yet can be\nchallenging for dynamically typed programming languages. In this paper we explore the use of neural network techniques to automatically learn code completion\nfrom a large corpus of dynamically typed JavaScript code. We show different\nneural networks that leverage not only token level information but also structural\ninformation, and evaluate their performance on different prediction tasks. We\ndemonstrate that our models can outperform the state-of-the-art approach, which\nis based on decision tree techniques, on both next non-terminal and next terminal\nprediction tasks by 3.8 points and 0.5 points respectively. We believe that neural\nnetwork techniques can play a transformative role in helping software developers\nmanage the growing complexity of software systems, and we see this work as a\nfirst step in that direction.

\n", "tags": ["autocomplete"], "tsne_embedding": [-8.614161491394043, 5.863827228546143]}, {"key": "wang2019learning", "year": "2019", "title": "Learning Scalable and Precise Representation of Program Semantics", "abstract": "

Neural program embedding has shown potential in aiding the analysis of large-scale, complicated software. Newly proposed deep neural architectures pride themselves on learning program semantics rather than superficial syntactic features. However, by considering the source code only, the vast majority of neural networks do not capture a deep, precise representation of program semantics. In this paper, we present \\dypro, a novel deep neural network that learns from program execution traces. Compared to the prior dynamic models, not only is \\dypro capable of generalizing across multiple executions for learning a program\u2019s dynamic semantics in its entirety, but \\dypro is also more efficient when dealing with programs yielding long execution traces. For evaluation, we task \\dypro with semantic classification (i.e. categorizing programs based on their semantics) and compared it against two prominent static models: Gated Graph Neural Network and TreeLSTM. We find that \\dypro achieves the highest prediction accuracy among all models. To further reveal the capacity of all aforementioned deep neural architectures, we examine if the models can learn to detect deeper semantic properties of a program. In particular given a task of recognizing loop invariants, we show \\dypro beats all static models by a wide margin.

\n", "tags": ["representation", "dynamic"], "tsne_embedding": [3.1096699237823486, 13.569807052612305]}, {"key": "wang2020blended", "year": "2020", "title": "Blended, precise semantic program embeddings", "abstract": "

Learning neural program embeddings is key to utilizing deep neural networks in program languages research \u2014 precise and efficient program representations enable the application of deep models to a wide range of program analysis tasks. Existing approaches predominately learn to embed programs from their source code, and, as a result, they do not capture deep, precise program semantics. On the other hand, models learned from runtime information critically depend on the quality of program executions, thus leading to trained models with highly variant quality. This paper tackles these inherent weaknesses of prior approaches by introducing a new deep neural network, Liger, which learns program representations from a mixture of symbolic and concrete execution traces. We have evaluated Liger on two tasks: method name prediction and semantics classification. Results show that Liger is significantly more accurate than the state-of-the-art static model code2seq in predicting method names, and requires on average around 10x fewer executions covering nearly 4x fewer paths than the state-of-the-art dynamic model DYPRO in both tasks. Liger offers a new, interesting design point in the space of neural program embeddings and opens up this new direction for exploration.

\n", "tags": ["dynamic"], "tsne_embedding": [3.193364143371582, 13.446542739868164]}, {"key": "wang2020cocogum", "year": "2020", "title": "CoCoGUM: Contextual Code Summarization with Multi-Relational GNN on UMLs", "abstract": "

Code summaries are short natural language (NL) descriptions of code snippets that help developers better understand and maintain source code. Due to the pivotal role of code summaries in software development and maintenance, there is a surge of works on automatic code summarization to reduce the heavy burdens of developers. However, contemporary approaches only leverage the information within the boundary of the method being summarized (i.e., local context), and ignore that using broader context could assist with code summarization. In this paper, we explore two global context information, namely intra-class and inter-class context information, and propose the model CoCoGUM: Contextual Code Summarization with Multi-Relational Graph Neural Networks on UMLs. CoCoGUM first incorporates class names as the intra-class context, which is further fed to a Transformer-based sentence embedding model to extract the class lexical embeddings. Then, relevant Unified Modeling Language (UML) class diagrams are extracted as inter-class context and we use a Multi-Relational Graph Neural Network (MR-GNN) to encode the class relational embeddings. Class lexical embeddings and class relational embeddings, together with the outputs from code token encoder and AST encoder, are passed to the decoder armed with a two-level attention mechanism to generate high-quality context-aware code summaries. We conduct extensive experiments to evaluate our approach and compare it with other automatic code summarization models. The experimental results show that CoCoGUM outperforms state-of-the-art methods.

\n", "tags": ["summarization"], "tsne_embedding": [-17.409530639648438, -7.076180458068848]}, {"key": "wang2020detecting", "year": "2020", "title": "Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree", "abstract": "

Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.

\n", "tags": ["clone", "GNN"], "tsne_embedding": [2.523263692855835, -7.683289527893066]}, {"key": "wang2020learning", "year": "2020", "title": "Learning Semantic Program Embeddings with Graph Interval Neural Network", "abstract": "

Learning distributed representations of source code has been a challenging task for machine learning models. Earlier works treated programs as text so that natural language methods can be readily applied. Unfortunately, such approaches do not capitalize on the rich structural information possessed by source code. Of late, Graph Neural Network (GNN) was proposed to learn embeddings of programs from their graph representations. Due to the homogeneous and expensive message-passing procedure, GNN can suffer from precision issues, especially when dealing with programs rendered into large graphs. In this paper, we present a new graph neural architecture, called Graph Interval Neural Network (GINN), to tackle the weaknesses of the existing GNN. Unlike the standard GNN, GINN generalizes from a curated graph representation obtained through an abstraction method designed to aid models to learn. In particular, GINN focuses exclusively on intervals for mining the feature representation of a program, furthermore, GINN operates on a hierarchy of intervals for scaling the learning to large graphs. We evaluate GINN for two popular downstream applications: variable misuse prediction and method name prediction. Results show in both cases GINN outperforms the state-of-the-art models by a comfortable margin. We have also created a neural bug detector based on GINN to catch null pointer deference bugs in Java code. While learning from the same 9,000 methods extracted from 64 projects, GINN-based bug detector significantly outperforms GNN-based bug detector on 13 unseen test projects. Next, we deploy our trained GINN-based bug detector and Facebook Infer to scan the codebase of 20 highly starred projects on GitHub. Through our manual inspection, we confirm 38 bugs out of 102 warnings raised by GINN-based bug detector compared to 34 bugs out of 129 warnings for Facebook Infer.

\n", "tags": ["GNN", "defect"], "tsne_embedding": [-2.8179855346679688, 11.643033027648926]}, {"key": "wang2020learning2", "year": "2020", "title": "Learning to Represent Programs with Heterogeneous Graphs", "abstract": "

Program source code contains complex structure information, which can be represented in structured data forms like trees or graphs. To acquire the structural information in source code, most existing researches use abstract syntax trees (AST). A group of works add additional edges to ASTs to convert source code into graphs and use graph neural networks to learn representations for program graphs. Although these works provide additional control or data flow information to ASTs for downstream tasks, they neglect an important aspect of structure information in AST itself: the different types of nodes and edges. In ASTs, different nodes contain different kinds of information like variables or control flow, and the relation between a node and all its children can also be different.

\n\n

To address the information of node and edge types, we bring the idea of heterogeneous graphs to learning on source code and present a new formula of building heterogeneous program graphs from ASTs with additional type information for nodes and edges. We use the ASDL grammar of programming language to define the node and edge types of program graphs. Then we use heterogeneous graph neural networks to learn on these graphs. We evaluate our approach on two tasks: code comment generation and method naming. Both tasks require reasoning on the semantics of complete code snippets. Experiment results show that our approach outperforms baseline models, including homogeneous graph-based models, showing that leveraging the type information of nodes and edges in program graphs can help in learning program semantics.

\n", "tags": ["GNN", "summarization"], "tsne_embedding": [-1.9396592378616333, 12.806200981140137]}, {"key": "wang2020modular", "year": "2020", "title": "Modular Tree Network for Source Code Representation Learning", "abstract": "

Learning representation for source code is a foundation of many program analysis tasks. In recent years, neural networks have already shown success in this area, but most existing models did not make full use of the unique structural information of programs. Although abstract syntax tree (AST)-based neural models can handle the tree structure in the source code, they cannot capture the richness of different types of substructure in programs. In this article, we propose a modular tree network that dynamically composes different neural network units into tree structures based on the input AST. Different from previous tree-structural neural network models, a modular tree network can capture the semantic differences between types of AST substructures. We evaluate our model on two tasks: program classification and code clone detection. Our model achieves the best performance compared with state-of-the-art approaches in both tasks, showing the advantage of leveraging more elaborate structure information of the source code.

\n", "tags": ["grammar", "representation"], "tsne_embedding": [0.305649995803833, -8.638275146484375]}, {"key": "wang2020trans", "year": "2020", "title": "TranS^3: A Transformer-based Framework for Unifying Code Summarization and Code Search", "abstract": "

Code summarization and code search have been widely adopted in sofwaredevelopmentandmaintenance. However, fewstudieshave explored the efcacy of unifying them. In this paper, we propose TranS^3 , a transformer-based framework to integrate code summarization with code search. Specifcally, for code summarization,TranS^3 enables an actor-critic network, where in the actor network, we encode the collected code snippets via transformer- and tree-transformer-based encoder and decode the given code snippet to generate its comment. Meanwhile, we iteratively tune the actor network via the feedback from the critic network for enhancing the quality of the generated comments. Furthermore, we import the generated comments to code search for enhancing its accuracy. To evaluatetheefectivenessof TranS^3 , we conduct a set of experimental studies and case studies where the experimental results suggest that TranS^3 can signifcantly outperform multiple state-of-the-art approaches in both code summarization and code search and the study results further strengthen the efcacy of TranS^3 from the developers\u2019 points of view.

\n", "tags": ["search", "documentation"], "tsne_embedding": [-14.620357513427734, -10.531896591186523]}, {"key": "wang2021codet5", "year": "2021", "title": "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation", "abstract": "

Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers. Our model employs a unified framework to seamlessly support both code understanding and generation tasks and allows for multi-task learning. Besides, we propose a novel identifier-aware pre-training task that enables the model to distinguish which code tokens are identifiers and to recover them when they are masked. Furthermore, we propose to exploit the user-written code comments with a bimodal dual generation task for better NL-PL alignment. Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL. Further analysis reveals that our model can better capture semantic information from code. Our code and pre-trained models are released at https://github.com/salesforce/CodeT5 .

\n", "tags": ["Transformer"], "tsne_embedding": [-5.197187423706055, -4.503530025482178]}, {"key": "wang2021syncobert", "year": "2021", "title": "SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation", "abstract": "

Code representation learning, which aims to encode the semantics of source code into distributed vectors, plays an important role in recent deep-learning-based models for code intelligence. Recently, many pre-trained language models for source code (e.g., CuBERT and CodeBERT) have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code search, code clone detection, and program translation. Current approaches typically consider the source code as a plain sequence of tokens, or inject the structure information (e.g., AST and data-flow) into the sequential model pre-training. To further explore the properties of programming languages, this paper proposes SynCoBERT, a syntax-guided multi-modal contrastive pre-training approach for better code representations. Specially, we design two novel pre-training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP), which are designed to predict identifiers, and edges between two nodes of AST, respectively. Meanwhile, to exploit the complementary information in semantically equivalent modalities (i.e., code, comment, AST) of the code, we propose a multi-modal contrastive learning strategy to maximize the mutual information among different modalities. Extensive experiments on four downstream tasks related to code intelligence show that SynCoBERT advances the state-of-the-art with the same pre-training corpus and model size.

\n", "tags": ["pretraining"], "tsne_embedding": [-4.022733211517334, -1.0537976026535034]}, {"key": "wang2023codet5", "year": "2023", "title": "CodeT5+: Open Code Large Language Models for Code Understanding and Generation", "abstract": "

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+\u2019\u2019, a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.

\n", "tags": ["Transformer"], "tsne_embedding": [-2.0687124729156494, -0.9910061955451965]}, {"key": "wang2023deepvd", "year": "2023", "title": "DeepVD: Toward Class-Separation Features for Neural Network Vulnerability Detection", "abstract": "

The advances of machine learning (ML) including deep learning (DL) have enabled several approaches to implicitly learn vulnerable code patterns to automatically detect software vulnerabilities. A recent study showed that despite successes, the existing ML/DL-based vulnerability detection (VD) models are limited in the ability to distinguish between the two classes of vulnerability and benign code. We propose DeepVD, a graph-based neural network VD model that emphasizes on class-separation features between vulnerability and benign code. DeepVD leverages three types of class-separation features at different levels of abstraction: statement types (similar to Part-of-Speech tagging), Post-Dominator Tree (covering regular flows of execution), and Exception Flow Graph (covering the exception and error-handling flows). We conducted several experiments to evaluate DeepVD in a real-world vulnerability dataset of 303 projects with 13,130 vulnerable methods. Our results show that DeepVD relatively improves over the state-of-the-art ML/DL-based VD approaches 13%\u201329.6% in precision, 15.6%\u201328.9% in recall, and 16.4%\u201325.8% in F-score. Our ablation study confirms that our designed features and components help DeepVD achieve high class-separability for vulnerability and benign code.

\n", "tags": ["vulnerability"], "tsne_embedding": [7.848397731781006, 18.973228454589844]}, {"key": "watson2021systematic", "year": "2021", "title": "A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research", "abstract": "

An increasingly popular set of techniques adopted by software engineering (SE) researchers to automate development tasks are those rooted in the concept of Deep Learning (DL). The popularity of such techniques largely stems from their automated feature engineering capabilities, which aid in modeling software artifacts. However, due to the rapid pace at which DL techniques have been adopted, it is difficult to distill the current successes, failures, and opportunities of the current research landscape. In an effort to bring clarity to this crosscutting area of work, from its modern inception to the present, this paper presents a systematic literature review of research at the intersection of SE & DL. The review canvases work appearing in the most prominent SE and DL conferences and journals and spans 128 papers across 23 unique SE tasks. We center our analysis around the components of learning, a set of principles that govern the application of machine learning techniques (ML) to a given problem domain, discussing several aspects of the surveyed work at a granular level. The end result of our analysis is a research roadmap that both delineates the foundations of DL techniques applied to SE research, and highlights likely areas of fertile exploration for the future.

\n", "tags": ["survey"], "tsne_embedding": [3.9605979919433594, 22.373615264892578]}, {"key": "waunakh2019idbench", "year": "2021", "title": "IdBench: Evaluating Semantic Representations of Identifier Names in Source Code", "abstract": "

Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of namebased analyses are semantic representations of identifiers, e.g., in the form of learned embeddings. The high-level goal of such a representation is to encode whether two identifiers, e.g., len and size, are semantically similar. Unfortunately, it is currently unclear to what extent semantic representations match the semantic relatedness and similarity perceived by developers. This paper presents IdBench, the first benchmark for evaluating semantic representations against a ground truth created from thousands of ratings by 500 software developers. We use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies significantly and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing technique provides a satisfactory representation of semantic similarities, among other reasons because identifiers with opposing meanings are incorrectly considered to be similar, which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other. As a first step toward exploiting this complementarity, we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation.

\n", "tags": ["representation"], "tsne_embedding": [6.656747341156006, -11.981378555297852]}, {"key": "wei2019code", "year": "2019", "title": "Code Generation as a Dual Task of Code Summarization", "abstract": "

Code summarization (CS) and code generation (CG) are two crucial tasks in the field of automatic software development. Various neural network-based approaches are proposed to solve these two tasks separately. However, there exists a specific intuitive correlation between CS and CG, which have not been exploited in previous work. In this paper, we apply the relations between two tasks to improve the performance of both tasks. In other words, exploiting the duality between the two tasks, we propose a dual training framework to train the two tasks simultaneously. In this framework, we consider the dualities on probability and attention weights, and design corresponding regularization terms to constrain the duality. We evaluate our approach on two datasets collected from GitHub, and experimental results show that our dual framework can improve the performance of CS and CG tasks over baselines.

\n", "tags": ["code generation", "summarization"], "tsne_embedding": [-13.622103691101074, -8.359620094299316]}, {"key": "wei2020lambdanet", "year": "2020", "title": "LambdaNet: Probabilistic Type Inference using Graph Neural Networks", "abstract": "

As gradual typing becomes increasingly popular in languages like Python and TypeScript, there is a growing need to infer type annotations automatically. While type annotations help with tasks like code completion and static error catching, these annotations cannot be fully inferred by compilers and are tedious to annotate by hand. This paper proposes a probabilistic type inference scheme for TypeScript based on a graph neural network. Our approach first uses lightweight source code analysis to generate a program abstraction called a type dependency graph, which links type variables with logical constraints as well as name and usage information. Given this program abstraction, we then use a graph neural network to propagate information between related type variables and eventually make type predictions. Our neural architecture can predict both standard types, like number or string, as well as user-defined types that have not been encountered during training. Our experimental results show that our approach outperforms prior work in this space by 14% (absolute) on library types, while having the ability to make type predictions that are out of scope for existing techniques.

\n", "tags": ["GNN", "types"], "tsne_embedding": [-4.131369590759277, 28.268749237060547]}, {"key": "wei2023typet5", "year": "2023", "title": "TypeT5: Seq2seq Type Inference using Static Analysis", "abstract": "

There has been growing interest in automatically predicting missing type annotations in programs written in Python and JavaScript. While prior methods have achieved impressive accuracy when predicting the most common types, they often perform poorly on rare or complex types. In this paper, we present a new type inference method that treats type prediction as a code infilling task by leveraging CodeT5, a state-of-the-art seq2seq pre-trained language model for code. Our method uses static analysis to construct dynamic contexts for each code element whose type signature is to be predicted by the model. We also propose an iterative decoding scheme that incorporates previous type predictions in the model\u2019s input context, allowing information exchange between related code elements. Our evaluation shows that the proposed approach, TypeT5, not only achieves a higher overall accuracy (particularly on rare and complex types) but also produces more coherent results with fewer type errors \u2013 while enabling easy user intervention.

\n", "tags": ["types", "Transformer"], "tsne_embedding": [-2.0304720401763916, 28.094324111938477]}, {"key": "white2015toward", "year": "2015", "title": "Toward Deep Learning Software Repositories", "abstract": "

Deep learning subsumes algorithms that automatically learn compositional representations. The ability of these\nmodels to generalize well has ushered in tremendous advances\nin many fields such as natural language processing (NLP).\nRecent research in the software engineering (SE) community\nhas demonstrated the usefulness of applying NLP techniques to\nsoftware corpora. Hence, we motivate deep learning for software\nlanguage modeling, highlighting fundamental differences between\nstate-of-the-practice software language models and connectionist\nmodels. Our deep learning models are applicable to source\ncode files (since they only require lexically analyzed source\ncode written in any programming language) and other types\nof artifacts. We show how a particular deep learning model\ncan remember its state to effectively model sequential data,\ne.g., streaming software tokens, and the state is shown to be\nmuch more expressive than discrete tokens in a prefix. Then we\ninstantiate deep learning models and show that deep learning\ninduces high-quality models compared to n-grams and cache-based n-grams on a corpus of Java projects. We experiment\nwith two of the models\u2019 hyperparameters, which govern their\ncapacity and the amount of context they use to inform predictions,\nbefore building several committees of software language models\nto aid generalization. Then we apply the deep learning models to\ncode suggestion and demonstrate their effectiveness at a real SE\ntask compared to state-of-the-practice models. Finally, we propose\navenues for future work, where deep learning can be brought to\nbear to support model-based testing, improve software lexicons,\nand conceptualize software artifacts. Thus, our work serves as\nthe first step toward deep learning software repositories.

\n", "tags": ["representation"], "tsne_embedding": [-2.993354320526123, 5.2688164710998535]}, {"key": "white2016deep", "year": "2016", "title": "Deep Learning Code Fragments for Code Clone Detection", "abstract": "

Code clone detection is an important problem for software\nmaintenance and evolution. Many approaches consider either structure or identifiers, but none of the existing detection techniques model both sources of information. These\ntechniques also depend on generic, handcrafted features to\nrepresent code fragments. We introduce learning-based detection techniques where everything for representing terms\nand fragments in source code is mined from the repository.\nOur code analysis supports a framework, which relies on\ndeep learning, for automatically linking patterns mined at\nthe lexical level with patterns mined at the syntactic level.\nWe evaluated our novel learning-based approach for code\nclone detection with respect to feasibility from the point\nof view of software maintainers. We sampled and manually\nevaluated 398 file- and 480 method-level pairs across eight\nreal-world Java systems; 93% of the file- and method-level\nsamples were evaluated to be true positives. Among the true\npositives, we found pairs mapping to all four clone types. We\ncompared our approach to a traditional structure-oriented\ntechnique and found that our learning-based approach detected clones that were either undetected or suboptimally\nreported by the prominent tool Deckard. Our results affirm\nthat our learning-based approach is suitable for clone detection and a tenable technique for researchers.

\n", "tags": ["clone"], "tsne_embedding": [3.76528000831604, -8.29830265045166]}, {"key": "white2017sorting", "year": "2017", "title": "Sorting and Transforming Program Repair Ingredients via Deep Learning Code Similarities", "abstract": "

In the field of automated program repair, the redundancy assumption claims large programs contain the seeds\nof their own repair. However, most redundancy-based program\nrepair techniques do not reason about the repair ingredients\u2014the code that is reused to craft a patch. We aim to reason about\nthe repair ingredients by using code similarities to prioritize and\ntransform statements in a codebase for patch generation. Our\napproach, DeepRepair, relies on deep learning to reason about\ncode similarities. Code fragments at well-defined levels of granularity in a codebase can be sorted according to their similarity\nto suspicious elements (i.e., code elements that contain suspicious\nstatements) and statements can be transformed by mapping out-of-scope identifiers to similar identifiers in scope. We examined\nthese new search strategies for patch generation with respect to\neffectiveness from the viewpoint of a software maintainer. Our\ncomparative experiments were executed on six open-source Java\nprojects including 374 buggy program revisions and consisted\nof 19,949 trials spanning 2,616 days of computation time. DeepRepair\u2019s search strategy using code similarities generally found\ncompilable ingredients faster than the baseline, jGenProg, but\nthis improvement neither yielded test-adequate patches in fewer\nattempts (on average) nor found significantly more patches than\nthe baseline. Although the patch counts were not statistically\ndifferent, there were notable differences between the nature of\nDeepRepair patches and baseline patches. The results demonstrate that our learning-based approach finds patches that cannot\nbe found by existing redundancy-based repair techniques

\n", "tags": ["repair"], "tsne_embedding": [17.880760192871094, 1.1914873123168945]}, {"key": "wong2021leveraging", "year": "2021", "title": "Leveraging Language to Learn Program Abstractions and Search Heuristics", "abstract": "

Inductive program synthesis, or inferring programs from examples of desired behavior, offers a general paradigm for building interpretable, robust, and generalizable machine learning systems. Effective program synthesis depends on two key ingredients: a strong library of functions from which to build programs, and an efficient search strategy for finding programs that solve a given task. We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis. When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization on three domains \u2013 string editing, image composition, and abstract reasoning about scenes \u2013 even when no natural language hints are available at test time.

\n", "tags": ["synthesis", "search"], "tsne_embedding": [6.011080265045166, 7.1109700202941895]}, {"key": "wu2021prototransformer", "year": "2021", "title": "ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback", "abstract": "

High-quality computer science education is limited by the difficulty of providing instructor feedback to students at scale. While this feedback could in principle be automated, supervised approaches to predicting the correct feedback are bottlenecked by the intractability of annotating large quantities of student code. In this paper, we instead frame the problem of providing feedback as few-shot classification, where a meta-learner adapts to give feedback to student code on a new programming question from just a few examples annotated by instructors. Because data for meta-training is limited, we propose a number of amendments to the typical few-shot learning framework, including task augmentation to create synthetic tasks, and additional side information to build stronger priors about each task. These additions are combined with a transformer architecture to embed discrete sequences (e.g. code) to a prototypical representation of a feedback class label. On a suite of few-shot natural language processing tasks, we match or outperform state-of-the-art performance. Then, on a collection of student solutions to exam questions from an introductory university course, we show that our approach reaches an average precision of 88% on unseen questions, surpassing the 82% precision of teaching assistants. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university. This is, to the best of our knowledge, the first successful deployment of a machine learning based feedback to open-ended student code.

\n", "tags": ["Transformer", "education"], "tsne_embedding": [-13.483322143554688, 17.49483871459961]}, {"key": "xia2023universal", "year": "2023", "title": "Universal Fuzzing via Large Language Models", "abstract": "

Fuzzing has achieved tremendous success in discovering bugs and vulnerabilities in various software systems. Systems under test (SUTs) that take in programming or formal language as inputs, e.g., compilers, runtime engines, constraint solvers, and software libraries with accessible APIs, are especially important as they are fundamental building blocks of software development. However, existing fuzzers for such systems often target a specific language, and thus cannot be easily applied to other languages or even other versions of the same language. Moreover, the inputs generated by existing fuzzers are often limited to specific features of the input language, and thus can hardly reveal bugs related to other or new features. This paper presents Fuzz4All, the first fuzzer that is universal in the sense that it can target many different input languages and many different features of these languages. The key idea behind Fuzz4All is to leverage large language models (LLMs) as an input generation and mutation engine, which enables the approach to produce diverse and realistic inputs for any practically relevant language. To realize this potential, we present a novel autoprompting technique, which creates LLM prompts that are wellsuited for fuzzing, and a novel LLM-powered fuzzing loop, which iteratively updates the prompt to create new fuzzing inputs. We evaluate Fuzz4All on nine systems under test that take in six different languages (C, C++, Go, SMT2, Java and Python) as inputs. The evaluation shows, across all six languages, that universal fuzzing achieves higher coverage than existing, language-specific fuzzers. Furthermore, Fuzz4All has identified 76 bugs in widely used systems, such as GCC, Clang, Z3, CVC5, OpenJDK, and the Qiskit quantum computing platform, with 47 bugs already confirmed by developers as previously unknown.

\n", "tags": ["fuzzing"], "tsne_embedding": [17.827844619750977, 12.710286140441895]}, {"key": "xu2019commit", "year": "2019", "title": "Commit Message Generation for Source Code Changes", "abstract": "

Commit messages, which summarize the source\ncode changes in natural language, are essential for\nprogram comprehension and software evolution understanding. Unfortunately, due to the lack of direct\nmotivation, commit messages are sometimes neglected by developers, making it necessary to\nautomatically generate such messages. State-of-the-art adopts learning based approaches such as\nneural machine translation models for the commitmessage generation problem. However, they tend\nto ignore the code structure information and suffer from the out-of-vocabulary issue.\nIn this paper, we propose CODISUM to address the above two limitations. In particular,\nwe first extract both code structure and code semantics from the source code changes, and then\njointly model these two sources of information so as to better learn the representations\n of the code changes. Moreover, we augment the model with copying mechanism to further\nmitigate the out-of-vocabulary issue. Experimental evaluations on real data demonstrate that\nthe proposed approach significantly outperforms the state-of-the-art in terms of accurately generating the commit messages.

\n", "tags": ["edit", "summarization"], "tsne_embedding": [-15.445657730102539, 2.5920045375823975]}, {"key": "xu2019method", "year": "2019", "title": "Method name suggestion with hierarchical attention networks", "abstract": "

Method Rename has been a widely used refactoring operation that improves program comprehension and maintenance. Descriptive method names that summarize functionalities of source code can facilitate program comprehension. Much research has been done to suggest method names through source code summarization. However, unlike natural language, a code snippet consists of basic blocks organized by complicated structures. In this work, we observe a hierarchical structure \u2014 tokens form basic blocks and basic blocks form a code snippet. Based on this observation, we exploit a hierarchical attention network to learn the representation of methods. Specifically, we apply two-level attention mechanism to learn the importance of each token in a basic block and that of a basic block in a method respectively. We evaluated our approach on 10 open source repositories and compared it against three state-of-the-art approaches. The results on these open-source data show the superiority of our hierarchical attention networks in terms of effectiveness.

\n", "tags": ["naming"], "tsne_embedding": [-18.125049591064453, -9.121139526367188]}, {"key": "xu2020incorporating", "year": "2020", "title": "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation", "abstract": "

Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at [Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen.

\n", "tags": ["bimodal", "code generation"], "tsne_embedding": [-5.78445291519165, -9.634625434875488]}, {"key": "xu2021capturing", "year": "2021", "title": "Capturing Structural Locality in Non-parametric Language Models", "abstract": "

Structural locality is a ubiquitous feature of real-world datasets, wherein data points are organized into local hierarchies. Some examples include topical clusters in text or project hierarchies in source code repositories. In this paper, we explore utilizing this structural locality within non-parametric language models, which generate sequences that reference retrieved examples from an external source. We propose a simple yet effective approach for adding locality information into such models by adding learned parameters that improve the likelihood of retrieving examples from local neighborhoods. Experiments on two different domains, Java source code and Wikipedia text, demonstrate that locality features improve model efficacy over models without access to these features, with interesting differences. We also perform an analysis of how and where locality features contribute to improved performance and why the traditionally used contextual similarity metrics alone are not enough to grasp the locality structure.

\n", "tags": ["language model"], "tsne_embedding": [-8.022143363952637, -21.435142517089844]}, {"key": "xu2022systematic", "year": "2022", "title": "A Systematic Evaluation of Large Language Models of Code", "abstract": "

Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at this https URL, which enables future research and application in this area.

\n", "tags": ["Transformer", "language model"], "tsne_embedding": [-0.24116714298725128, 2.91437029838562]}, {"key": "yadavally2023partial", "year": "2023", "title": "(Partial) Program Dependence Learning", "abstract": "

Code fragments from developer forums often migrate to applications due to the code reuse practice. Owing to the incomplete nature of such programs, analyzing them to early determine the presence of potential vulnerabilities is challenging. In this work, we introduce NeuralPDA, a neural network-based program dependence analysis tool for both complete and partial programs. Our tool efficiently incorporates intra-statement and inter-statement contextual features into statement representations, thereby modeling program dependence analysis as a statement-pair dependence decoding task. In the empirical evaluation, we report that NeuralPDA predicts the CFG and PDG edges in complete Java and C/C++ code with combined F-scores of 94.29% and 92.46%, respectively. The F-score values for partial Java and C/C++ code range from 94.29%\u201397.17% and 92.46%\u201396.01%, respectively. We also test the usefulness of the PDGs predicted by NEURALPDA (i.e., PDG) on the downstream task of method-level vulnerability detection. We discover that the performance of the vulnerability detection tool utilizing PDG is only 1.1% less than that utilizing the PDGs generated by a program analysis tool. We also report the detection of 14 real-world vulnerable code snippets from StackOverflow by a machine learning-based vulnerability detection tool that employs the PDGs predicted by NeuralPDA for these code snippets.

\n", "tags": ["large language models", "program analysis", "static analysis", "tool"], "tsne_embedding": [12.073339462280273, 11.827444076538086]}, {"key": "yadavally2024learning", "year": "2024", "title": "A Learning-Based Approach to Static Program Slicing", "abstract": "

Traditional program slicing techniques are crucial for early bug detection and manual/automated debugging of online code snippets. Nevertheless, their inability to handle incomplete code hinders their real-world applicability in such scenarios. To overcome these challenges, we present NS-Slicer, a novel learning-based approach that predicts static program slices for both complete and partial code. Our tool leverages a pre-trained language model to exploit its understanding of fine-grained variable-statement dependencies within source code. With this knowledge, given a variable at a specific location and a statement in a code snippet, NS-Slicer determines whether the statement belongs to the backward slice or forward slice, respectively. We conducted a series of experiments to evaluate NS-Slicer\u2019s performance. On complete code, it predicts the backward and forward slices with an F1-score of 97.41% and 95.82%, respectively, while achieving an overall F1-score of 96.77%. Notably, in 85.20% of the cases, the static program slices predicted by NS-Slicer exactly match entire slices from the oracle. For partial programs, it achieved an F1-score of 96.77%\u201397.49% for backward slicing, 92.14%\u201395.40% for forward slicing, and an overall F1-score of 94.66%\u201396.62%. Furthermore, we demonstrate NS-Slicer\u2019s utility in vulnerability detection (VD), integrating its predicted slices into an automated VD tool. In this setup, the tool detected vulnerabilities in Java code with a high F1-score of 73.38%. We also include the analyses studying NS-Slicer\u2019s promising performance and limitations, providing insights into its understanding of intrinsic code properties such as variable aliasing, leading to better slicing.

\n", "tags": ["large language models", "program analysis", "static", "tool"], "tsne_embedding": [12.164413452148438, 10.471586227416992]}, {"key": "yadavally2024predictive", "year": "2024", "title": "Predictive Program Slicing via Execution Knowledge-Guided Dynamic Dependence Learning", "abstract": "

Program slicing, the process of extracting program statements that influence values at a designated location (known as the slicing criterion), is helpful in both manual and automated debugging. However, such slicing techniques prove ineffective in scenarios where executing specific inputs is prohibitively expensive, or even impossible, as with partial code. In this paper, we introduce ND-Slicer, a predictive slicing methodology that caters to specific executions based on a particular input, overcoming the need for actual execution. We enable such a process by leveraging execution-aware pre-training to learn the dynamic program dependencies, including both dynamic data and control dependencies between variables in the slicing criterion and the remaining program statements. Such knowledge forms the cornerstone for constructing a predictive backward slice. Our empirical evaluation revealed a high accuracy in predicting program slices, achieving an exact-match accuracy of 81.3% and a ROUGE-LCS F1-score of 95.4% on Python programs. As an extrinsic evaluation, we illustrate ND-Slicer\u2019s usefulness in crash detection, with it locating faults with an accuracy of 63.9%. Furthermore, we include an in-depth qualitative evaluation, assessing ND-Slicer\u2019s understanding of branched structures such as if-else blocks and loops, as well as the control flow in inter-procedural calls.

\n", "tags": ["large language models", "program analysis", "dynamic", "tool"], "tsne_embedding": [11.847169876098633, 10.275728225708008]}, {"key": "yadid2016extracting", "year": "2016", "title": "Extracting Code from Programming Tutorial Videos", "abstract": "

The number of programming tutorial videos on the web\nincreases daily. Video hosting sites such as YouTube host\nmillions of video lectures, with many programming tutorials for various languages and platforms. These videos contain a wealth of valuable information, including code that\nmay be of interest. However, two main challenges have so\nfar prevented the effective indexing of programming tutorial\nvideos: (i) code in tutorials is typically written on-the-fly,\nwith only parts of the code visible in each frame, and (ii) optical character recognition (OCR) is not precise enough to\nproduce quality results from videos.

\n\n

We present a novel approach for extracting code from\nvideos that is based on: (i) consolidating code across frames,\nand (ii) statistical language models for applying corrections\nat different levels, allowing us to make corrections by choosing the most likely token, combination of tokens that form a\nlikely line structure, and combination of lines that lead to\na likely code fragment in a particular language. We implemented our approach in a tool called ACE , and used it to extract code from 40 Android video tutorials on YouTube . Our\nevaluation shows that ACE extracts code with high accuracy,\nenabling deep indexing of video tutorials.

\n", "tags": ["information extraction"], "tsne_embedding": [-4.508297920227051, 19.62556266784668]}, {"key": "yan2020are", "year": "2020", "title": "Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries", "abstract": "

Code search methods, especially those that allow programmers to raise queries in a natural language, plays an important role in software development. It helps to improve programmers\u2019 productivity by returning sample code snippets from the Internet and/or source-code repositories for their natural-language queries. Meanwhile, there are many code search methods in the literature that support natural-language queries. Difficulties exist in recognizing the strengths and weaknesses of each method and choosing the right one for different usage scenarios, because (1) the implementations of those methods and the datasets for evaluating them are usually not publicly available, and (2) some methods leverage different training datasets or auxiliary data sources and thus their effectiveness cannot be fairly measured and may be negatively affected in practical uses. To build a common ground for measuring code search methods, this paper builds CosBench, a dataset that consists of 1000 projects, 52 code-independent natural-language queries with ground truths, and a set of scripts for calculating four metrics on code research results. We have evaluated four IR (Information Retrieval)-based and two DL (Deep Learning)-based code search methods on CosBench. The empirical evaluation results clearly show the usefulness of the CosBench dataset and various strengths of each code search method. We found that DL-based methods are more suitable for queries on reusing code, and IR-based ones for queries on resolving bugs and learning API uses.

\n", "tags": ["search"], "tsne_embedding": [-4.312857627868652, -15.686369895935059]}, {"key": "yang2017language", "year": "2017", "title": "A Language Model for Statements of Software Code", "abstract": "

Building language models for source code enables a large set of improvements on traditional software engineering tasks. One promising application is automatic code completion. State-of-the-art techniques capture code regularities at token level with lexical information. Such language models are more suitable for predicting short token sequences, but become less effective with respect to long statement level predictions. In this paper, we have proposed PCC to optimize the token level based language modeling. Specifically, PCC introduced an intermediate representation (IR) for source code, which puts tokens into groups using lexeme and variable relative order. In this way, PCC is able to handle long token sequences, i.e., group sequences, to suggest a complete statement with the precise synthesizer. Further more, PCC employed a fuzzy matching technique which combined genetic and longest common sub-sequence algorithms to make the prediction more accurate. We have implemented a code completion plugin for Eclipse and evaluated it on open-source Java projects. The results have demonstrated the potential of PCC in generating precise long statement level predictions. In 30%-60% of the cases, it can correctly suggest the complete statement with only six candidates, and 40%-90% of the cases with ten candidates.

\n", "tags": ["language model"], "tsne_embedding": [-11.593055725097656, -17.0959415435791]}, {"key": "yang2020survey", "year": "2020", "title": "A Survey on Deep Learning for Software Engineering", "abstract": "

In 2006, Geoffrey Hinton proposed the concept of training \u2018\u2018Deep Neural Networks (DNNs)\u2019\u2019 and an improved model training method to break the bottleneck of neural network development. More recently, the introduction of AlphaGo in 2016 demonstrated the powerful learning ability of deep learning and its enormous potential. Deep learning has been increasingly used to develop state-of-the-art software engineering (SE) research tools due to its ability to boost performance for various SE tasks. There are many factors, e.g., deep learning model selection, internal structure differences, and model optimization techniques, that may have an impact on the performance of DNNs applied in SE. Few works to date focus on summarizing, classifying, and analyzing the application of deep learning techniques in SE. To fill this gap, we performed a survey to analyse the relevant studies published since 2006. We first provide an example to illustrate how deep learning techniques are used in SE. We then summarize and classify different deep learning techniques used in SE. We analyzed key optimization technologies used in these deep learning models, and finally describe a range of key research topics using DNNs in SE. Based on our findings, we present a set of current challenges remaining to be investigated and outline a proposed research road map highlighting key opportunities for future work.

\n", "tags": ["survey"], "tsne_embedding": [4.128140449523926, 22.0460205078125]}, {"key": "yao2018staqc", "year": "2018", "title": "StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow", "abstract": "

Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1 and accuracy. Furthermore, we present StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of \u223c148K Python and \u223c120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language.

\n", "tags": ["dataset"], "tsne_embedding": [-3.670806646347046, -11.225981712341309]}, {"key": "yao2019coacor", "year": "2019", "title": "CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning", "abstract": "

To accelerate software development, much research has been performed\nto help people understand and reuse the huge amount of available code\nresources. Two important tasks have been widely studied: code retrieval,\nwhich aims to retrieve code snippets relevant to a given natural language\nquery from a code base, and code annotation, where the goal is to annotate a \ncode snippet with anatural language description. Despite their advancement in recent\nyears, the two tasks are mostly explored separately. In this work, we\ninvestigate a novel perspective of Code annotation for Code retrieval \n(hence called \u201cCoaCor\u201d), where a code annotation model is trained\nto generate a natural language annotation that can represent the\nsemantic meaning of a given code snippet and can be leveraged by\na code retrieval model to better distinguish relevant code snippets\nfrom others. To this end, we propose an effective framework based\non reinforcement learning, which explicitly encourages the code\nannotation model to generate annotations that can be used for the\nretrieval task. Through extensive experiments, we show that code\nannotations generated by our framework are much more detailed\nand more useful for code retrieval, and they can further improve\nthe performance of existing code retrieval models significantly.

\n", "tags": ["search"], "tsne_embedding": [-5.740281581878662, -13.76944351196289]}, {"key": "yasunaga2020graph", "year": "2020", "title": "Graph-based, Self-Supervised Program Repair from Diagnostic Feedback", "abstract": "

We consider the problem of learning to repair programs from diagnostic feedback (e.g., compiler error messages). Program repair is challenging for two reasons: First, it requires reasoning and tracking symbols across source code and diagnostic feedback. Second, labeled datasets available for program repair are relatively small. In this work, we propose novel solutions to these two challenges. First, we introduce a program-feedback graph, which connects symbols relevant to program repair in source code and diagnostic feedback, and then apply a graph neural network on top to model the reasoning process. Second, we present a self-supervised learning paradigm for program repair that leverages unlabeled programs available online to create a large amount of extra program repair examples, which we use to pre-train our models. We evaluate our proposed approach on two applications: correcting introductory programming assignments (DeepFix dataset) and correcting the outputs of program synthesis (SPoC dataset). Our final system, DrRepair, significantly outperforms prior work, achieving 66.1% full repair rate on DeepFix (+20.8% over the prior best), and 48.0% synthesis success rate on SPoC (+3.3% over the prior best).

\n", "tags": ["repair", "edit", "GNN"], "tsne_embedding": [23.776636123657227, 1.278108835220337]}, {"key": "ye2020leveraging", "year": "2020", "title": "Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning", "abstract": "

Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural language and programming language, recent studies have combined these two tasks to improve their performance. However, researchers have yet been able to effectively leverage the intrinsic connection between the two tasks as they train these tasks in a separate or pipeline manner, which means their performance can not be well balanced. In this paper, we propose a novel end-to-end model for the two tasks by introducing an additional code generation task. More specifically, we explicitly exploit the probabilistic correlation between code summarization and code generation with dual learning, and utilize the two encoders for code summarization and code generation to train the code retrieval task via multi-task learning. We have carried out extensive experiments on an existing dataset of SQL and Python, and results show that our model can significantly improve the results of the code retrieval task over the-state-of-art models, as well as achieve competitive performance in terms of BLEU score for the code summarization task.

\n", "tags": ["search", "summarization"], "tsne_embedding": [-12.388094902038574, -9.483141899108887]}, {"key": "ye2020misim", "year": "2020", "title": "MISIM: An End-to-End Neural Code Similarity System", "abstract": "

Code similarity systems are integral to a range of applications from code recommendation to automated construction of software tests and defect mitigation. In this paper, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters. We compare MISIM to three other state-of-the-art code similarity systems: (i) code2vec, (ii) Neural Code Comprehension, and (iii) Aroma. In our experimental evaluation across 45,780 programs, MISIM consistently outperformed all three systems, often by a large factor (upwards of 40.6x).

\n", "tags": ["code similarity"], "tsne_embedding": [1.0811818838119507, -12.453810691833496]}, {"key": "ye2021neural", "year": "2021", "title": "Neural Program Repair with Execution-based Backpropagation", "abstract": "

Neural machine translation (NMT) architectures have achieved promising results for automatic program repair. Yet, they have the limitation of generating low-quality patches (e.g., not compilable patches). This is because the existing works only optimize a purely syntactic loss function based on characters and tokens without incorporating program-specific information during neural net weight optimization. In this paper, we propose a novel program repair model called RewardRepair. The core novelty of RewardRepair is to improve NMT-based program repair with a loss function based on program compilation and test execution information, rewarding the network to produce patches that compile and that do not overfit. We conduct several experiments to evaluate RewardRepair showing that it is feasible and effective to use compilation and test execution results to optimize the underlying neural repair model. In total, RewardRepair correctly repairs 43 Defects4J bugs including eight that are fixed for the first time.

\n", "tags": ["repair"], "tsne_embedding": [21.299501419067383, -0.2764773666858673]}, {"key": "ye2022selfapr", "year": "2022", "title": "SelfAPR: Self-supervised Program Repair with Test Execution Diagnostics", "abstract": "

Neural program repair has achieved good results in a recent series of papers. Yet, we observe that the related work fails to repair some bugs because of a lack of knowledge about 1) the program being repaired, and 2) the actual fault being repaired. In this paper, we solve both problems by changing the learning paradigm from supervised training to self-supervised training in an approach called SelfAPR. First, SelfAPR generates and constructs training samples by perturbing a previous version of the program being repaired, enforcing the neural model to capture project-specific knowledge. This is different from all the existing work based on past commits. Second, SelfAPR extracts and encodes test execution diagnostics into the input representation, steering the neural model to fix the specific kind of fault. This is different from the existing studies that only consider static source code in the input. We implement SelfAPR and evaluate it in a systematic manner. We train SelfAPR with 253 411 training samples obtained by perturbing 17 open-source projects. We evaluate SelfAPR on 818 bugs from Defects4J, SelfAPR correctly repairs 112 of them.

\n", "tags": ["repair", "execution"], "tsne_embedding": [23.68659210205078, 1.8757392168045044]}, {"key": "yefet2019adversarial", "year": "2019", "title": "Adversarial Examples for Models of Code", "abstract": "

Neural models of code have shown impressive performance for tasks such as predicting method names and identifying certain kinds of bugs. In this paper, we show that these models are vulnerable to adversarial examples, and introduce a novel approach for attacking trained models of code with adversarial examples. The main idea is to force a given trained model to make an incorrect prediction as specified by the adversary by introducing small perturbations that do not change the program\u2019s semantics. To find such perturbations, we present a new technique for Discrete Adversarial Manipulation of Programs (DAMP). DAMP works by deriving the desired prediction with respect to the model\u2019s inputs while holding the model weights constant and following the gradients to slightly modify the code.

\n\n

To defend a model against such attacks, we propose placing a defensive model (Anti-DAMP) in front of it. Anti-DAMP detects unlikely mutations and masks them before feeding the input to the downstream model.

\n\n

We show that our DAMP attack is effective across three neural architectures: code2vec, GGNN, and GNN-FiLM, in both Java and C#. We show that DAMP has up to 89% success rate in changing a prediction to the adversary\u2019s choice (\u201ctargeted attack\u201d), and a success rate of up to 94% in changing a given prediction to any incorrect prediction (\u201cnon-targeted attack\u201d). By using Anti-DAMP, the success rate of the attack drops drastically for both targeted and non-targeted attacks, with a minor penalty of 2% relative degradation in accuracy while not performing under attack.

\n", "tags": ["adversarial"], "tsne_embedding": [11.136536598205566, 21.312532424926758]}, {"key": "yin2017syntactic", "year": "2017", "title": "A Syntactic Neural Model for General-Purpose Code Generation", "abstract": "

We consider the problem of parsing natural language descriptions into source code\nwritten in a general-purpose programming\nlanguage like Python. Existing data-driven methods treat this problem as a language generation task without considering\nthe underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture\npowered by a grammar model to explicitly\ncapture the target syntax as prior knowledge. Experiments find this an effective\nway to scale up to generation of complex\nprograms from natural language descriptions, achieving state-of-the-art results that\nwell outperform previous code generation\nand semantic parsing approaches.

\n", "tags": ["code generation", "grammar", "bimodal"], "tsne_embedding": [-21.938304901123047, -3.5005171298980713]}, {"key": "yin2018mining", "year": "2018", "title": "Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow", "abstract": "

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.

\n\n", "tags": ["dataset"], "tsne_embedding": [-3.962278127670288, -10.88506031036377]}, {"key": "yin2019learning", "year": "2019", "title": "Learning to Represent Edits", "abstract": "

We introduce the problem of learning distributed representations of edits. By combining a\n\u201cneural editor\u201d with an \u201cedit encoder\u201d, our models learn to represent the salient\ninformation of an edit and can be used to apply edits to new inputs.\nWe experiment on natural language and source code edit data. Our evaluation yields\npromising results that suggest that our neural network models learn to capture\nthe structure and semantics of edits. We hope that this interesting task and\ndata source will inspire other researchers to work further on this problem.

\n", "tags": ["edit"], "tsne_embedding": [-12.10683536529541, 0.8281735777854919]}, {"key": "yin2022natural", "year": "2022", "title": "Natural Language to Code Generation in Interactive Data Science Notebooks", "abstract": "

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.

\n", "tags": ["notebook", "evaluation"], "tsne_embedding": [3.162888765335083, -0.5272204875946045]}, {"key": "yonai2019mercem", "year": "2019", "title": "Mercem: Method Name Recommendation Based on Call Graph Embedding", "abstract": "

Comprehensibility of source code is strongly affected by identifier names, therefore software developers need to give good (e.g. meaningful but short) names to identifiers. On the other hand, giving a good name is sometimes a difficult and time-consuming task even for experienced developers. To support naming identifiers, several techniques for recommending identifier name candidates have been proposed. These techniques, however, still have challenges on the goodness of suggested candidates and limitations on applicable situations. This paper proposes a new approach to recommending method names by applying graph embedding techniques to the method call graph. The evaluation experiment confirms that the proposed technique can suggest more appropriate method name candidates in difficult situations than the state of the art approach.

\n", "tags": ["naming", "representation", "refactoring"], "tsne_embedding": [12.822196006774902, -9.811189651489258]}, {"key": "yuan2017abridging", "year": "2017", "title": "Abridging Source Code", "abstract": "

In this paper, we consider the problem of source code abridgment, where the goal is to remove statements from a source code in order to display the source code in a small space, while at the same time leaving the ``important\u2019\u2019 parts of the source code intact, so that an engineer can read the code and quickly understand purpose of the code. To this end, we develop an algorithm that looks at a number of examples, human-created source code abridgments, and learns how to remove lines from the code in order to mimic the human abridger. The learning algorithm takes into account syntactic features of the code, as well as semantic features such as control flow and data dependencies. Through a comprehensive user study, we show that the abridgments that our system produces can decrease the time that a user must look at code in order to understand its functionality, as well as increase the accuracy of the assessment, while displaying the code in a greatly reduced area.

\n", "tags": ["summarization"], "tsne_embedding": [-19.59737777709961, -10.015637397766113]}, {"key": "zaremba2014learning", "year": "2014", "title": "Learning to Execute", "abstract": "

Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) are widely used because they are expressive and are easy to train. Our interest lies in empirically evaluating the expressiveness and the learnability of LSTMs in the sequence-to-sequence regime by training them to evaluate short computer programs, a domain that has traditionally been seen as too complex for neural networks. We consider a simple class of programs that can be evaluated with a single left-to-right pass using constant memory. Our main result is that LSTMs can learn to map the character-level representations of such programs to their correct outputs. Notably, it was necessary to use curriculum learning, and while conventional curriculum learning proved ineffective, we developed a new variant of curriculum learning that improved our networks\u2019 performance in all experimental conditions. The improved curriculum had a dramatic impact on an addition problem, making it possible to train an LSTM to add two 9-digit numbers with 99% accuracy.

\n", "tags": ["execution", "representation"], "tsne_embedding": [-23.83325958251953, 2.888960599899292]}, {"key": "zeng2022extensive", "year": "2022", "title": "An Extensive Study on Pre-trained Models for Program Understanding and Generation", "abstract": "

Automatic program understanding and generation techniques could\nsignificantly advance the productivity of programmers and have\nbeen widely studied by academia and industry. Recently, the advent of pre-trained paradigm enlightens researchers to develop\ngeneral-purpose pre-trained models which can be applied for a\nbroad range of program understanding and generation tasks. Such\npre-trained models, derived by self-supervised objectives on large\nunlabelled corpora, can be fine-tuned in downstream tasks (such\nas code search and code generation) with minimal adaptations. Although these pre-trained models claim superiority over the prior\ntechniques, they seldom follow equivalent evaluation protocols, e.g.,\nthey are hardly evaluated on the identical benchmarks, tasks, or settings. Consequently, there is a pressing need for a comprehensive\nstudy of the pre-trained models on their effectiveness, versatility\nas well as the limitations to provide implications and guidance for\nthe future development in this area. To this end, we first perform\nan extensive study of eight open-access pre-trained models over\na large benchmark on seven representative code tasks to assess\ntheir reproducibility. We further compare the pre-trained models\nand domain-specific state-of-the-art techniques for validating pre-trained effectiveness. At last, we investigate the robustness of the\npre-trained models by inspecting their performance variations under adversarial attacks. Through the study, we find that while we\ncan in general replicate the original performance of the pre-train\nmodels on their evaluated tasks and adopted benchmarks, subtle\nperformance fluctuations can refute the findings in their original\npapers. Moreover, none of the existing pre-trained models can dominate over all other models. We also find that the pre-trained models\ncan significantly outperform non-pre-trained state-of-the-art techniques in program understanding tasks. Furthermore, we perform\nthe first study for natural language-programming language pre-trained model robustness via adversarial attacks and find that a\nsimple random attack approach can easily fool the state-of-the-art\npre-trained models and thus incur security issues. At last, we also\nprovide multiple practical guidelines for advancing future research\non pre-trained models for program understanding and generation.

\n", "tags": ["Transformer", "evaluation"], "tsne_embedding": [3.874532461166382, 1.903812289237976]}, {"key": "zhang2019learning", "year": "2019", "title": "Learning Uniform Semantic Features for Natural Language and Programming Language Globally, Locally and Sequentially", "abstract": "

Semantic feature learning for natural language and programming language is a preliminary step in addressing many software mining tasks. Many existing methods leverage\ninformation in lexicon and syntax to learn features for textual data.\nHowever, such information is inadequate to represent the entire semantics in either text sentence or code snippet. This\nmotivates us to propose a new approach to learn semantic\nfeatures for both languages, through extracting three levels of\ninformation, namely global, local and sequential information,\nfrom textual data. For tasks involving both modalities, we\nproject the data of both types into a uniform feature space so\nthat the complementary knowledge in between can be utilized\nin their representation. In this paper, we build a novel and\ngeneral-purpose feature learning framework called UniEmbed, to uniformly learn comprehensive semantic representation for both natural language and programming language.\nExperimental results on three real-world software mining\ntasks show that UniEmbed outperforms state-of-the-art models in feature learning and prove the capacity and effectiveness of our model.

\n", "tags": ["representation", "bimodal"], "tsne_embedding": [-3.3959097862243652, -9.633265495300293]}, {"key": "zhang2019novel", "year": "2019", "title": "A Novel Neural Source Code Representation based on Abstract Syntax Tree", "abstract": "

Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches.

\n", "tags": ["representation", "grammar"], "tsne_embedding": [-0.30873483419418335, -8.945197105407715]}, {"key": "zhang2020generating", "year": "2020", "title": "Generating Adversarial Examples for Holding Robustness of Source Code Processing Models", "abstract": "

Automated processing, analysis, and generation of source code are among the key activities\nin software and system life-cycle. To this end, while deep learning (DL) exhibits a certain level\nof capability in handling these tasks, the current state-of-the-art DL models still suffer from\nnon-robust issues and can be easily fooled by adversarial attacks.

\n\n

Different from adversarial \nattacks for image, audio, andnatural languages, the structured nature of programming\nlanguages brings new challenges. In this paper, we propose a Metropolis-Hastings\nsampling-based identifier renaming technique, named Metropolis-Hastings Modifier (MHM),\nwhich generates adversarial examples for DL models specialized for source code processing.\nOur in-depth evaluation on a functionality classification benchmark demonstrates the\neffectiveness of MHM in generating adversarial examples of source code. The higher robustness\nand performance enhanced through our adversarial training with MHM further confirms the usefulness\nof DL models-based method for future fully automated source code processing.

\n", "tags": ["adversarial"], "tsne_embedding": [9.607763290405273, 21.818538665771484]}, {"key": "zhang2021bag", "year": "2021", "title": "Bag-of-Words Baselines for Semantic Code Search", "abstract": "

The task of semantic code search is to retrieve code snippets from a source code corpus based on an information need expressed in natural language. The semantic gap between natural language and programming languages has for long been regarded as one of the most significant obstacles to the effectiveness of keyword-based information retrieval (IR) methods. It is a common assumption that \u201ctraditional\u201d bag-of-words IR methods are poorly suited for semantic code search: our work empirically investigates this assumption. Specifically, we examine the effectiveness of two traditional IR methods, namely BM25 and RM3, on the CodeSearchNet Corpus, which consists of natural language queries paired with relevant code snippets. We find that the two keyword-based methods outperform several pre-BERT neural models. We also compare several code-specific data pre-processing strategies and find that specialized tokenization improves effectiveness.

\n", "tags": ["search"], "tsne_embedding": [-5.324240207672119, -15.57623291015625]}, {"key": "zhang2021disentangled.md", "year": "2021", "title": "Disentangled Code Representation Learning for Multiple Programming Languages", "abstract": "

Developing effective distributed representations of source code is fundamental yet challenging for many software engineering tasks such as code clone detection, code search, code translation and transformation. However, current code embedding approaches that represent the semantic and syntax of code in a mixed way are less interpretable and the resulting embedding can not be easily generalized across programming languages. In this paper, we propose a disentangled code representation learning approach to separate the semantic from the syntax of source code under a multi-programming-language setting, obtaining better interpretability and generalizability. Specially, we design three losses dedicated to the characteristics of source code to enforce the disentanglement effectively. We conduct comprehensive experiments on a real-world dataset composed of programming exercises implemented by multiple solutions that are semantically identical but grammatically distinguished. The experimental results validate the superiority of our proposed disentangled code representation, compared to several baselines, across three types of downstream tasks, i.e., code clone detection, code translation, and code-to-code search.

\n", "tags": ["representation"], "tsne_embedding": [2.133340835571289, -10.133064270019531]}, {"key": "zhang2022coditt5", "year": "2022", "title": "CoditT5: Pretraining for Source Code and Natural Language Editing", "abstract": "

Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming pure generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a pure generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks.

\n", "tags": ["Transformer", "edit"], "tsne_embedding": [-13.638455390930176, 0.19851242005825043]}, {"key": "zhang2023repocoder", "year": "2023", "title": "RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation", "abstract": "

The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model, which allows for the effective utilization of repository-level information for code completion and grants the ability to generate code at various levels of granularity. Furthermore, RepoCoder utilizes a novel iterative retrieval-generation paradigm that bridges the gap between retrieval context and the intended completion target. We also propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder by using various combinations of code retrievers and generators. Experimental results indicate that RepoCoder significantly improves the zero-shot code completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research.

\n", "tags": ["completion", "Transformer", "retrieval"], "tsne_embedding": [-8.205622673034668, -15.8959321975708]}, {"key": "zhao2018neural", "year": "2018", "title": "Neural-Augumented Static Analysis of Android Communication", "abstract": "

We address the problem of discovering communication links between applications in the popular Android mobile operating system, an important problem for security and privacy in Android. Any scalable static analysis in this complex setting is bound to produce an excessive amount of false-positives, rendering it impractical. To improve precision, we propose to augment static analysis with a trained neural-network model that estimates the probability that a communication link truly exists. We describe a neural-network architecture that encodes abstractions of communicating objects in two applications and estimates the probability with which a link indeed exists. At the heart of our architecture are type-directed encoders (TDE), a general framework for elegantly constructing encoders of a compound data type by recursively composing encoders for its constituent types. We evaluate our approach on a large corpus of Android applications, and demonstrate that it achieves very high accuracy. Further, we conduct thorough interpretability studies to understand the internals of the learned neural networks.

\n", "tags": ["program analysis"], "tsne_embedding": [24.369873046875, 15.105952262878418]}, {"key": "zhao2019neural", "year": "2019", "title": "Neural Networks for Modeling Source Code Edits", "abstract": "

Programming languages are emerging as a challenging and interesting domain for machine learning. A core task, which has received significant attention in recent years, is building generative models of source code. However, to our knowledge, previous generative models have always been framed in terms of generating static snapshots of code. In this work, we instead treat source code as a dynamic object and tackle the problem of modeling the edits that software developers make to source code files. This requires extracting intent from previous edits and leveraging it to generate subsequent edits. We develop several neural networks and use synthetic data to test their ability to learn challenging edit patterns that require strong generalization. We then collect and train our models on a large-scale dataset of Google source code, consisting of millions of fine-grained edits from thousands of Python developers. From the modeling perspective, our main conclusion is that a new composition of attentional and pointer network components provides the best overall performance and scalability. From the application perspective, our results provide preliminary evidence of the feasibility of developing tools that learn to predict future edits.

\n", "tags": ["edit"], "tsne_embedding": [-11.345458030700684, 1.0072425603866577]}, {"key": "zhong2018generating", "year": "2018", "title": "Generating Regular Expressions from Natural Language Specifications: Are We There Yet?", "abstract": "

Recent state-of-the-art approaches automatically generate\nregular expressions from natural language specifications.\nGiven that these approaches use only synthetic data in both\ntraining datasets and validation/test datasets, a natural question arises: are these approaches effective to address various\nreal-world situations? To explore this question, in this paper, we conduct a characteristic study on comparing two synthetic datasets used by the recent research and a real-world\ndataset collected from the Internet, and conduct an experimental study on applying a state-of-the-art approach on the\nreal-world dataset. Our study results suggest the existence of\ndistinct characteristics between the synthetic datasets and the\nreal-world dataset, and the state-of-the-art approach (based\non a model trained from a synthetic dataset) achieves extremely low effectiveness when evaluated on real-world data,\nmuch lower than the effectiveness when evaluated on the synthetic dataset. We also provide initial analysis on some of\nthose challenging cases and discuss future directions.

\n", "tags": ["bimodal", "code generation"], "tsne_embedding": [-18.520305633544922, -20.122739791870117]}, {"key": "zhong2020semantic", "year": "2020", "title": "Semantic Scaffolds for Pseudocode-to-Code Generation", "abstract": "

We propose a method for program generation based on semantic scaffolds, lightweight structures representing the high-level semantic and syntactic composition of a program. By first searching over plausible scaffolds then using these as constraints for a beam search over programs, we achieve better coverage of the search space when compared with existing techniques. We apply our hierarchical search method to the SPoC dataset for pseudocode-to-code generation, in which we are given line-level natural language pseudocode annotations and aim to produce a program satisfying execution-based test cases. By using semantic scaffolds during inference, we achieve a 10% absolute improvement in top-100 accuracy over the previous state-of-the-art. Additionally, we require only 11 candidates to reach the top-3000 performance of the previous best approach when tested against unseen problems, demonstrating a substantial improvement in efficiency.

\n", "tags": ["code generation", "synthesis"], "tsne_embedding": [-9.859814643859863, -9.016633033752441]}, {"key": "zhou2019devign", "year": "2020", "title": "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks", "abstract": "

Vulnerability identification is crucial to protect the software systems from attacks for cyber security. It is especially important to localize the vulnerable functions among the source code to facilitate the fix. However, it is a challenging and tedious process, and also requires specialized security expertise. Inspired by the work on manually-defined patterns of vulnerabilities from various code representation graphs and the recent advance on graph neural networks, we propose Devign, a general graph neural network based model for graph-level classification through learning on a rich set of code semantic representations. It includes a novel Conv module to efficiently extract useful features in the learned rich node representations for graph-level classification. The model is trained over manually labeled datasets built on 4 diversified large-scale open-source C projects that incorporate high complexity and variety of real source code instead of synthesis code used in previous works. The results of the extensive evaluation on the datasets demonstrate that Devign outperforms the state of the arts significantly with an average of 10.51% higher accuracy and 8.68% F1 score, increases averagely 4.66% accuracy and 6.37% F1 by the Conv module.

\n", "tags": ["GNN", "static analysis"], "tsne_embedding": [7.272311210632324, 17.830419540405273]}, {"key": "zhou2021improving", "year": "2021", "title": "Improving Code Autocompletion with Transfer Learning", "abstract": "

Software language models have achieved promising results predicting code completion usages, and several industry studies have described successful IDE integrations. Recently, accuracy in autocompletion prediction improved 12.8% from training on a real-world dataset collected from programmers\u2019 IDE activity. But what if limited examples of IDE autocompletion in the target programming language are available for model training? In this paper, we investigate the efficacy of pretraining autocompletion models on non-IDE, non-autocompletion, and different-language example code sequences. We find that these unsupervised pretrainings improve model accuracy by over 50% on very small fine-tuning datasets and over 10% on 50k labeled examples. We confirm the real-world impact of these pretrainings in an online setting through A/B testing on thousands of IDE autocompletion users, finding that pretraining is responsible for increases of up to 6.63% autocompletion usage.

\n", "tags": ["autocomplete", "Transformer"], "tsne_embedding": [-7.495175838470459, 7.743719577789307]}, {"key": "zhou2022codebertscore", "year": "2023", "title": "CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code", "abstract": "

Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.

\n", "tags": ["evaluation", "Transformer"], "tsne_embedding": [-5.547420024871826, -5.937540531158447]}, {"key": "zhou2022docoder", "year": "2022", "title": "DocCoder: Generating Code by Retrieving and Reading Docs", "abstract": "

Natural-language-to-code models learn to generate a code snippet given a natural language (NL) intent. However, the rapid growth of both publicly available and proprietary libraries and functions makes it impossible to cover all APIs using training examples, as new libraries and functions are introduced daily. Thus, existing models inherently cannot generalize to using unseen functions and libraries merely through incorporating them into the training data. In contrast, when human programmers write programs, they frequently refer to textual resources such as code manuals, documentation, and tutorials, to explore and understand available library functionality. Inspired by this observation, we introduce DocCoder: an approach that explicitly leverages code manuals and documentation by (1) retrieving the relevant documentation given the NL intent, and (2) generating the code based on the NL intent and the retrieved documentation. Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model. We demonstrate that DocCoder consistently improves NL-to-code models: DocCoder achieves 11x higher exact match accuracy than strong baselines on a new Bash dataset tldr; on the popular Python CoNaLa benchmark, DocCoder improves over strong baselines by 1.65 BLEU.

\n", "tags": ["Transformer", "search", "code generation"], "tsne_embedding": [-6.816485404968262, -10.46729564666748]}, {"key": "zhu2020ocor", "year": "2020", "title": "OCoR: An Overlapping-Aware Code Retriever", "abstract": "

Code retrieval helps developers reuse the code snippet in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code among a set of code. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., \u201cmessage\u201d and \u201cmsg\u201d), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related. To address these problems, we propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps: the first embeds identifiers by character to capture the overlaps between identifiers, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier.\nThe evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of different components in OCoR.

\n", "tags": ["search"], "tsne_embedding": [-2.2864298820495605, -13.117897033691406]}, {"key": "zhu2921syntax", "year": "2021", "title": "A Syntax-Guided Edit Decoder for Neural Program Repair", "abstract": "

Automated Program Repair (APR) helps improve the efficiency of software development and maintenance. Recent APR techniques use deep learning, particularly the encoder-decoder architecture, to generate patches.\nThough existing DL-based APR approaches have proposed different encoder architectures, the decoder remains to be the standard one, which generates a sequence of tokens one by one to replace the faulty statement.\nThis decoder has multiple limitations: 1) allowing to generate syntactically incorrect programs, 2) inefficiently representing small edits, and 3) not being able to generate project-specific identifiers.\nIn this paper, we propose Recoder, a syntax-guided edit decoder with placeholder generation. Recoder is novel in multiple aspects: 1) Recoder generates edits rather than modified code, allowing efficient representation of small edits; 2) Recoder is syntax-guided, with the novel provider/decider architecture to ensure the syntactic correctness of the patched program and accurate generation; 3) Recoder generates placeholders that could be instantiated as project-specific identifiers later.\nWe conduct experiments to evaluate Recoder on 395 bugs from Defects4J v1.2, 420 additional bugs from Defects4J v2.0, 297 bugs from IntroClassJava and 40 bugs from QuixBugs. Our results show that Recoder repairs 53 bugs on Defects4J v1.2, which achieves 26.2% (11 bugs) improvement over the previous state-of-the-art approach for single-hunk bugs (TBar). Importantly, to our knowledge, Recoder is the first DL-based APR approach that has outperformed the traditional APR approaches on this benchmark.

\n", "tags": ["edit"], "tsne_embedding": [19.371950149536133, 1.4317572116851807]}, {"key": "ziegler2022productivity", "year": "2022", "title": "Productivity Assessment of Neural Code Completion", "abstract": "

Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers\u2019 productivity, without being able to measure it directly. In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data. We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers\u2019 perception of productivity.

\n", "tags": ["evaluation", "human evaluation"], "tsne_embedding": [-9.200102806091309, 7.861722946166992]}, {"key": "zlotchevski2022exploring", "year": "2022", "title": "Exploring and Evaluating Personalized Models for Code Generation", "abstract": "

Large Transformer models achieved the state-of-the-art status for Natural Language Understanding tasks and are increasingly becoming the baseline model architecture for modeling source code. Transformers are usually pre-trained on large unsupervised corpora, learning token representations and transformations relevant to modeling generally available text, and are then fine-tuned on a particular downstream task of interest. While fine-tuning is a tried-and-true method for adapting a model to a new domain \u2013 for example, question-answering on a given topic \u2013 generalization remains an on-going challenge. In this paper, we explore and evaluate transformer model fine-tuning for personalization. In the context of generating unit tests for Java methods, we evaluate learning to personalize to a specific software project using several personalization techniques. We consider three key approaches: (i) custom fine-tuning, which allows all the model parameters to be tuned; (ii) lightweight fine-tuning, which freezes most of the model\u2019s parameters, allowing tuning of the token embeddings and softmax layer only or the final layer alone; (iii) prefix tuning, which keeps model parameters frozen, but optimizes a small project-specific prefix vector. Each of these techniques offers a trade-off in total compute cost and predictive performance, which we evaluate by code and task-specific metrics, training time, and total computational operations. We compare these fine-tuning strategies for code generation and discuss the potential generalization and cost benefits of each in various deployment scenarios.

\n", "tags": ["Transformer"], "tsne_embedding": [0.9705953001976013, -1.4199587106704712]}, {"key": "zugner2021language", "year": "2021", "title": "Language-Agnostic Representation Learning of Source Code from Structure and Context", "abstract": "

Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code.

\n", "tags": ["Transformer", "representation"], "tsne_embedding": [-13.282480239868164, -7.119506359100342]}] \ No newline at end of file

+ + Machine Learning for Big Code and Naturalness + +

404: Page not found

- - {{ site.title }} - -

{{ page.title }}

{{ page.title }}

{{ page.authors }}. {{ page.conference | default: page.journal }} {{ page.year }}

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Contributing

Adding a publication

Adding a new categorization

Reusing the website structure

+ + Machine Learning for Big Code and Naturalness + +

Machine Learning on Source Code

🏷 Browse Papers by Tag

About This Site

Contributing

Contributors

Contributors to the website

+ + Machine Learning for Big Code and Naturalness + +

+ + Machine Learning for Big Code and Naturalness + +

Graph4Code: A Machine Interpretable Knowledge Graph for Code

Ibrahim Abdelaziz, Julian Dolby, James P. McCusker, Kavitha Srinivas. 2020

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

Rajas Agashe, Srinivasan Iyer, Luke Zettlemoyer. 2019

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Using Machine Translation for Converting Python 2 to Python 3 Code

Karan Aggarwal, Mohammad Salameh, Abram Hindle. 2015

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context

Lakshya A Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, Sriram Rajamani. NeurIPS 2023

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

A Transformer-based Approach for Source Code Summarization

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. ACL 2020

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Unified Pre-training for Program Understanding and Generation

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang. NAACL 2021

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Learning Lenient Parsing & Typing via Indirect Supervision

Toufique Ahmed, Vincent Hellendoorn, Premkumar Devanbu. 2019

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Learning code summarization from a small and local dataset

Toufique Ahmed, Premkumar Devanbu. 2022

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Studying LLM Performance on Closed- and Open-source Data

Toufique Ahmed, Christian Bird, Premkumar Devanbu, Saikat Chakraborty. 2024

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Improving Few-Shot Prompts with Relevant Static Analysis Products

Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, Earl T. Barr. 2023

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

A large-scale benchmark for few-shot program induction and synthesis

Ferran Alet, Javier Lopez-Contreras, James Koppel, Maxwell Nye, Armando Solar-Lezama, Tomas Lozano-Perez, Leslie Kaelbling, Joshua Tenenbaum. ICML 2021

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

SantaCoder: don’t reach for the stars!

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Mining Source Code Repositories at Massive Scale Using Language Modeling

Miltiadis Allamanis, Charles Sutton. MSR 2013

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Learning Natural Coding Conventions

Miltiadis Allamanis, Earl T. Barr, Christian Bird, Charles Sutton. FSE 2014

Similar Work

+ + Machine Learning for Big Code and Naturalness + +

Mining Idioms from Source Code

Miltiadis Allamanis, Charles Sutton. FSE 2014

Similar Work

+ + Machine Learning for Big Code and Naturalness + +