POPL 2021 - Artifact Evaluation

Paper artifacts are the software, mechanised proofs, test suites, and benchmarks that support the paper and evaluate its claims. To evaluate paper artifacts, POPL has since 2015 run an artifact evaluation process similar efforts in our community.

Submit your Artifact

Artifact evaluation is an optional process that we highly encourage. We solicit artifacts from all authors of accepted papers. Artifacts can be software, mechanical proofs, test suites, benchmarks, or anything else that bolsters the claims of the paper, except paper proofs, which the AEC lacks the time and expertise to carefully review.

Paper notification: 1 October 2020 (AoE)
Artifact registration deadline: 5 October 2020 (AoE)
Artifact submission deadline: 8 October 2020 (AoE)
Artifact evaluation phase 1: 22 October 2020 (AoE)
Artifact evaluation phase 2: 2 November 2020 (AoE)
Camera-ready deadline: 4 November 2020 (AoE)

Acceptance Criteria

The AEC targets a 100% acceptance rate. Artifacts are evaluated against the criteria of:

Consistency with the claims of the paper and the results it presents
Completeness insofar as possible, supporting all evaluated claims of the paper
Documentation to allow easy reproduction by other researchers
Reusability, facilitating further research

Installing, configuring, and testing unknown software of research quality is difficult. Please see our Recommendations on packaging and documenting your artifact in a way that is easier for the AEC to evaluate.

Nominations

For the seventh year, POPL is creating an Artifact Evaluation Committee (AEC) to promote repeatable experiments in the POPL community. Last year, 56% of the accepted papers submitted artifacts.

We are proud to announce an open call for nominations of senior graduate students, postdocs, or research community members to serve on the POPL 2021 AEC. We encourage self-nominations from anyone interested in participating in the artifact evaluation process. Participation in the AEC can provide useful insight into the value of artifacts and the process of artifact evaluation, and also helps to establish community norms for artifacts.

Please nominate yourself or others

We particularly encourage graduate students to serve on the AEC. Qualified graduate students are often in a much better position than many researchers to handle the diversity of systems expectations we will encounter. In addition, these graduate students represent the future of the community, so involving them in this process early will help push this process forward.

The work of the AEC will be done between October 1 and November 2, so we are looking for people who are available during that time. You can find more information about the responsibilities of AEC members in the reviewer guidelines.

Please fill out the nomination form for each nominee by July 7, 2020. Please note that in the past we have had more nominees than slots on the AEC, and so not all nominees will be able to join the AEC.

We look forward to your nominations! Thank you for your help!

Artifact evaluation begins with authors of (conditionally) accepted POPL papers submitting artifacts on HotCRP (https://popl21ae.hotcrp.com). Artifact evaluation is optional. Authors are strongly encouraged to submit an artifact to the AEC, but not doing so will not impact their paper acceptance. Authors may, but are not required to, provide their artifacts to paper reviewers as supplemental materials.

Artifacts are submitted as a stable URL or, if that is not possible, as an uploaded archive. We recommend using a URL that you can update in response to reviewer comments, to fix issues that come up. Additionally, authors are asked to enter topics, conflicts, and “bidding instructions” to be used for assigning reviewers. You must check the “ready for review” box before the deadline for your artifact to be considered.

Artifact evaluation is single blind. Artifacts should not collect any telemetry or logging data; if that is impossible to ensure, logging data should not be accessed by authors. Any data files included with the artifact should be anonymized.

Reviewers will be instructed that they may not publicize any part of an artifact during or after completing evaluation, nor retain any part of one after evaluation. Thus, authors are free to include models, data files, proprietary binaries, etc. in your artifact.

AEC Membership

The AEC will consist of roughly 30 members, mostly senior graduate students, postdocs, and researchers. As the future of our community, graduate students will be the ones reproducing, reusing, and building upon the results published at POPL 2021. They are also better positioned to handle the diversity of systems that artifacts span.

This year, for the second time, we are hosting a public call for committee members. Please see the Call for Nominations for more details. Participation in the AEC demonstrates the value of artifacts, provides early experience with the peer review process, and establishes community norms. We therefore seek to include a broad cross-section of the POPL community on the AEC.

Two-Phase Evaluation

This year, the artifact evaluation process will proceed in two phases.

In the first “kick the tires” phase reviewers download and install the artifact (if relevant) and exercise the basic functionality of the artifact to ensure that it works. We recommend authors include explicit instructions for this step. Failing the first phase—so that reviewers are unable to download and install the artifact—will prevent the artifact from being accepted.

In the second “evaluation” phase reviewers systematically evaluate all claims in the paper via procedures included in the artifact to ensure consistency, completeness, documentation, and reusability. We recommend authors list all claims in the paper and indicate how to evaluate each claim using the artifact.

Reviewers and authors will communicate back and forth during the review process over HotCRP. We have set up HotCRP to allow reviewers to ask questions or raise issues: those questions and issues will immediately be forwarded to authors, who will be able to answer questions or implement fixes.

After the two-phase evaluation process, the AEC will discuss each artifact and notify authors of the final decision.

Two days separate the AEC notification from the camera ready deadline for accepted papers. This gap allows authors time to update their papers to indicate artifact acceptance.

Badges

The AEC will award three ACM standard badges. Badges are added to papers by the publisher, not by the authors.

All artifacts that pass artifact evaluation will receive the “Artifacts Evaluated - Functional” badge.

Artifacts that receive above average scores and are made available in a way that enables reuse (including source code availability, open source licensing, and an open issue tracker), such as via GitHub, GitLab, or BitBucket, will be awarded the “Artifacts Evaluated - Reusable” badge.

Finally, artifacts that pass artifact evaluation and where the authors additionally make an immutable snapshot of their artifacts available eternally on a publicly accessible archival repository such as Zenodo or ACM DL will also receive ACM’s “Artifacts Available” badge. An immutable snapshot does not prevent authors from also distributing their code in another way, on an open source platform or on their personal websites.

These recommendations represent our view of how to package and document your artifact in a way that makes successful evaluation most likely. The guidelines are not mandatory, and diverging from them will not disqualify your artifact.

Packaging

We recommend creating a single web page at a stable URL from which reviewers can download the artifact and which also hosts instructions for installing and using the artifact. Having a stable URL, instead of uploading an archive, allows you to update the artifact in response to issues that come up.

Zenodo is a great way to create a stable URL for your artifact. Not only can you upload multiple versions in response to reviewer comments, you can use the same stable URL when publishing your paper to avoid uploading your artifact twice.

We recommend packaging your artifact as a virtual machine image. Virtual machine images avoid issues with differing operating systems, versions, or dependencies. Other options for artifacts (such as source code, binary installers, web versions, or screencasts) are acceptable but generally cause more issues for reviewers and thus more issues for you. Virtual machines also protect reviewers from malfunctioning artifacts damaging their computer.

VirtualBox is free and widely-available virtual machine host software; it is a good choice. Recent graphical linux releases, such as Ubuntu 20.04 LTS, are good choices for the guest OS: the reviewer can easily navigate the image or install additional tools, and the resulting virtual machines are not too large to download.

The virtual machine should contain everything necessary for artifact evaluation: your source code, any compiler and build system your software needs, any platforms your artifact runs on (a proof assistant, or Cabal, or the Android emulator), any data or benchmarks, and all the tools you use to summarize the results of experiments such as plotting software. Execute all the steps you expect the reviewer to do in the virtual machine: the virtual machine should already have dependencies pre-downloaded, the software pre-compiled, and output files pre-generated. This quick-starts the evaluation.

Insofar as possible, do not include anything in the artifact that would compromise reviewer anonymity, such as telemetry or analytics.

If reasonable, have your artifact output data like benchmark times or evaluation results to a file instead of to a console window.

Further advice for particular types of artifacts can be found online:

Like the rest of these recommendations, these links are general guidelines and may not be possible for your artifact.

Documentation

Besides the artifact itself, we recommend your documentation contain four sections:

A complete list of claims made by your paper
Download, installation, and sanity-testing instructions
Evaluation instructions
Additional artifact description

Artifact submissions are single-blind: reviewers will know the authors for each artifact, so there is no need to expend effort to anonymize the artifact. If you have any questions about how best to package your artifact, please don’t hesitate to contact the AEC chairs, at pavpan@cs.utah.edu and jeehoon.kang@kaist.ac.kr.

List of claims

The list of claims should list all claims made in the paper. For each claim, provide a reference to the claim in the paper, the portion of the artifact evaluating that claim, and the evaluation instructions for evaluating that claim. The artifact need not support every claim in the paper; when evaluating the completeness of an artifact, reviewers will weigh the centrality and importance of the supported claims. Listing each claim individually provides the reviewer with a checklist to follow during the second, evaluation phase of the process. Organize the list of claims by section and subsection of the paper. A claim might read,

Theorem 12 from Section 5.2 of the paper corresponds to the theorem “foo” in the Coq file “src/Blah.v” and is evaluated in Step 7 of the evaluation instructions.

Some artifacts may attempt to perform malicious operations by design. Boldly and explicitly flag this in the instructions so AEC members can take appropriate precautions before installing and running these artifacts.

Reviewers expect artifacts to be buggy, immature, and have obscure error messages. Explicitly listing all claims allows the author to delineate which bugs invalidate the paper’s results and which are simply a normal part of the software engineering process.

Download, installation, and sanity-testing

The download, installation, and sanity-testing instructions should contain complete instructions for obtaining a copy of the artifact and ensuring that it works. List any software the reviewer will need (such as virtual machine host software) along with version numbers and platforms that are known to work. Then list all files the reviewer will need to download (such as the virtual machine image) before beginning. Downloads take time, and reviewers prefer to complete all downloads before beginning evaluation.

Note the guest OS used in the virtual machine, and any unusual modifications made to it. Explain its directory layout. It’s a good idea to put your artifact on the desktop of a graphical guest OS or in the home directory of a terminal-only guest OS.

Installation and sanity-testing instructions should list all steps necessary to set up the artifact and ensure that it works. This includes explaining how to invoke the build system; how to run the artifact on small test cases, benchmarks, or proofs; and the expected output. Your instructions should make clear which directory to run each command from, what output files it generates, and how to compare those output files to the paper. If your artifact generates plots, the sanity testing instructions should check that the plotting software works and the plots can be viewed.

Helper scripts that automate building the artifact, running it, and viewing the results can help reviewers out. Test those scripts carefully—what do they do if run twice?

Aim for the download, installation, and sanity-testing instructions to be completable in about a half hour. Remember that reviewers will not know what error messages mean or how to circumvent errors. The more foolproof the artifact, the easier evaluation will be for them and for you.

Evaluation instructions

The evaluation instructions should describe how to run the complete artifact, end to end, and then evaluate each claim in the paper that the artifact supports. This section often takes the form of a series of commands that generate evaluation data, and then a claim-by-claim list of how to check that the evaluation data is similar to the claims in the paper.

For each command, note the output files it writes to, so the reviewer knows where to find the results. If possible, generate data in the same format and organization as in the paper: for a table, include a script that generates a similar table, and for a plot, generate a similar plot.

Indicate how similar you expect the artifact results to be. Program speed usually differs in a virtual machine, and this may lead to, for example, more timeouts. Indicate how many you expect. You might write, for example:

The paper claims 970/1031 benchmarks pass (claim 5). Because the program runs slower in a virtual machine, more benchmarks time out, so as few as 930 may pass.

Reviewers must use their judgement to check if the suggested comparison is reasonable, but the author can provide expert guidance to set expectations.

Explicitly include commands that check soundness, such as counting admits in a Coq code base. Explain any checks that fail.

Aim for the evaluation instructions to take no more than a few hours. Clearly note steps that take more than a few minutes to complete. If the artifact cannot be evaluated in a few hours (experiments that require days to run, for example) consider an alternative artifact format, like a screencast.

Additional artifact description

The additional description should explain how the artifact is organized, which scripts and source files correspond to which experiments and components in the paper, and how reviewers can try their own inputs to the artifact. For a mechanical proof, this section can point the reviewer to key definitions and theorems.

Expect reviewers to examine this section if something goes wrong (an unexpected error, for example) or if they are satisfied with the artifact and want to explore it further.

Reviewers expect that new inputs can trigger bugs, flag warnings, or behave oddly. However, describing the artifact’s organization lends credence to claims of reusability. Reviewers may also want to examine components of the artifact that interest them.

Remember that the AEC is attempting to determine whether the artifact meets the expectations set by the paper. (The instructions to the committee are available here.) Package your artifact to help the committee easily evaluate this.

Artifacts are accessed, and reviews submitted, via HotCRP: https://popl21ae.hotcrp.com/

Deadlines

Do not wait to bid and review until the last minute. Artifact evaluation often hits snags that require author feedback. Waiting until the last minute robs authors of good-faith evaluations.

Artifact bidding starts: Tue, 6 October 2020 (AoE)
Artifact bidding deadline: Thu, 8 October 2020 (AoE)
Phase I reviews due: 22 October 2020 (AoE)
Phase II reviews due: 29 October 2020 (AoE)
Final decisions: 1 November 2020 (AoE)

Install the artifacts early and maximize troubleshooting during phase 1. At the end of phase 2, submit your reviews early, so you have more time to read other reviews and come to consensus on the relative quality of your assigned artifacts.

Downloading Artifacts

Download the artifact early. Many artifacts will be virtual machine images, often gigabytes in size. You may want to download the artifacts overnight. Make sure you have the disk space for them.

As a reviewer, you do not have permission to make public the contents of any artifact. AEC members must not publicize any part of an artifact during or after completing evaluation, nor retain any part of it after evaluation. This rule allows extending artifact evaluation to proprietary software, limited-distribution data sets, and corporate researchers that need legal permission to participate in the artifact evaluation process.

While you’re waiting for artifacts to download, read the paper, especially any sections on evaluation, case studies, or mechanised proofs. You will need to compare the artifact with expectations set by the paper.

Phase 1 review

Attempt to download, install, and sanity-check the artifacts you are reviewing as early as possible. The authors will need time to troubleshoot any issues.

The HotCRP configuration for Artifact Evaluation allows for direct communication between reviewers and authors while preserving reviewer anonymity. You can directly ask authors for help if you get stuck or hit an unexpected error. Do not spend a long time trying to figure issues out on your own. The authors deserve to know about bugs in their artifact, so that they can improve it for others.

Make sure your questions are marked “author-visible comments”. You can also make comments visible only to other reviewers if you want to discuss the artifact.

Each review is available to authors as soon as it is submitted. Only submit a review if it is ready for the authors—that is, if it is substantially complete. The authors can respond to a review or answer a question using HotCRP comments.

HotCRP comments are not a place to argue with authors. It is the responsibility of all reviewers of a paper to use their best judgement to evaluate papers and then to come to a consensus on the artifact quality. AEC chairs have final authority over artifact outcomes.

Artifact evaluation guidelines

Once you have downloaded, installed, and sanity-tested the artifact successfully, move on to evaluating the artifact against the claims in the paper.

Your goal is to test whether the artifact meets expectations set by the paper. Do not review the paper. Do not review the appropriateness of the paper evaluation. Merely review whether the artifact embodies the claims the paper makes of it.

This is not a completely objective process and that’s okay. We want to know if the artifact meets your expectations as a researcher. Does something in the artifact annoy you or delight you? You should say so in your review. Structure your evaluation against the four criteria of consistency, completeness, documentation, and reusability.

Complete replicability is ideal, but artifacts only need to support the paper’s claims. Virtual machines differ from the author’s environment, and many tools are stochastic. Deviations, especially on timings, are expected. Do not demand exactness.

The ideal artifact evaluates all claims in the paper, but this is also not required for acceptance. You must judge whether the artifact sufficiently supports all of the important claims made in the paper. The paper may claim software A and B, but the artifact might contain only A. This may be appropriate if A is the focus of the paper, or if the paper makes clear that only A is part of the artifact. Feel free to say, “I wish you’d also provided B as an artifact”, but know that it won’t affect this paper.

The documentation will never be as good as it could be—do not evaluate documentation on those grounds. Instead, judge whether the documentation allowed you to complete the tasks envisioned in the paper.

We encourage you to try your own tests, but don’t be a true adversary. This is research software so stuff will break, and that breakage does not impinge on reusability. Assume the authors acted in good faith and aren’t trying to hoodwink us. Instead, suppose you had to use/modify the artifact for your own research. Do you think you could? You’ll have to imagine and extrapolate, but that’s okay.

You may find it easier to adjust scores once you’ve reviewed all 2-3 of your artifacts. Moreover, once you read others’ reviews, you’ll get a better sense of average artifact quality. Don’t hesitate to change your scores later, being aware that authors will immediately see these updates too.

Selecting the best artifacts

This year we will again be awarding both the default “Functional” badge and the cooler “Reusable” badge. The cooler badge is for the artifacts that receive the best scores and that are additionally made available in a way that enables reuse—for software artifacts this means that the authors committed to release the artifact freely, under an open source license, and make it clear how to submit issues. Your role is to not only check that the artifacts we accept meet the expectations set by the paper, but also to identify the best reusable artifacts.

We believe the dissemination of artifacts benefits our science and engineering as a whole. Their availability improves replicability and reproducibility, and enables authors to build on top of each others’ work. It can also help more unambiguously resolve questions about cases not considered by the original authors.

The goal of the artifact evaluation process is two-fold: to both reward and probe. Our primary goal is to reward authors who take the trouble to create useful artifacts beyond the paper. Sometimes the software tools that accompany the paper take years to build; in many such cases, authors who go to this trouble should be rewarded for setting high standards and creating systems that others in the community can build on. Conversely, authors sometimes take liberties in describing the status of their artifacts—claims they would temper if they knew the artifacts are going to be scrutinized. This leads to more accurate reporting.

Our hope is that eventually, the assessment of a paper’s accompanying artifacts will guide the decision-making about papers: that is, the Artifact Evaluation Committee (AEC) would inform and advise the Program Committee (PC). This would, however, represent a radical shift in our conference evaluation processes; we would rather proceed gradually. Thus, artifact evaluation is optional, and authors choose to undergo evaluation only after their paper has been conditionally accepted. Nonetheless, feedback from the Artifact Evaluation Committee can help improve the both the final version of the paper and any publicly released artifacts. The authors are free to take or ignore the AEC feedback at their discretion.

Beyond helping the community as a whole, it confers several direct and indirect benefits to the authors themselves. The most direct benefit is, of course, the recognition that the authors accrue. But the very act of creating a bundle that can be evaluated by the AEC confers several benefits:

The same bundle can be publicly released or distributed to third-parties.
A bundle can be used subsequently for later experiments (e.g., on new parameters).
The bundle simplifies having to re-run the system subsequently when, say, having to respond to a journal reviewer’s questions.
The bundle is more likely to survive being put in storage between the departure of one student and the arrival of the next.

Creating a bundle that meets all these properties can be onerous and therefore, the process we describe below does not require an artifact to have all these properties. It offers a route to evaluation that confers fewer benefits for vastly less effort.

Questions? Use the POPL Artifact Evaluation contact form.

Artifact EvaluationPOPL 2021

Nominations

Process

Author Recommendations

Reviewer Guidelines

Background

Jeehoon KangArtifact Evaluation Co-Chair

KAIST

South Korea

Pavel PanchekhaArtifact Evaluation Co-Chair

University of Utah

United States

Heiko Becker

MPI-SWS

Germany

Tej Chajed

Massachusetts Institute of Technology, USA

United States

Stefan Ciobaca

Alexandru Ioan Cuza University of Iasi

Hoang-Hai Dang

MPI-SWS

Germany

Germán Andrés Delbianco

Nomadic Labs

France

Pierre Donat-Bouillud

Czech Technical University

Stefania Dumbrava

ENSIIE Paris-Évry

France

Aymeric Fromherz

Carnegie Mellon University

William T. Hallahan

Yale University

Anastasia Isychev

Technical University of Munich

Germany

Jaehwang Jung

KAIST, South Korea

Stella Lau

MIT

Théo Laurent

INRIA Paris

Nicholas V. Lewchenko

University of Colorado Boulder

United States

Yao Li

University of Pennsylvania

United States

Steven Lyubomirsky

University of Washington, USA

United States

Guido Martínez

CIFASIS-CONICET, Argentina

Argentina

Sidi Mohamed Beillahi

IRIF - Université de Paris

France

Raphaël Monat

Sorbonne Université — LIP6

France

Charlie Murphy

Princeton University

United States

Benjamin Barslev Nielsen

Aarhus University

Yuanfeng Peng

Google

George Pîrlea

National University of Singapore, Singapore

Lionel Rieg

Verimag

Gabriel Scherer

INRIA Saclay

France

Kartik Singhal

University of Chicago

United States

Gus Henry Smith