POPL 2023 - Artifact Evaluation

Paper artifacts are the software, mechanised proofs, test suites, and benchmarks that support the paper and evaluate its claims. To evaluate paper artifacts, POPL has since 2015 run an artifact evaluation process similar efforts in our community.

Submit your Artifact

Artifact evaluation is an optional process that we highly encourage. We solicit artifacts from authors of all accepted papers. Artifacts can be software, mechanical proofs, test suites, benchmarks, or anything else that bolsters the claims of the paper, except paper proofs, which the AEC lacks the time and expertise to carefully review.

Artifact registration deadline: 1 October 2022 (AoE)
Artifact submission deadline: 5 October 2022 (AoE)
Artifact evaluation phase 1: 14 October 2022 (AoE)
Artifact acceptance decisions: 3 November 2022 (AoE)

Acceptance Criteria

The AEC targets a 100% acceptance rate. Artifacts are evaluated against the criteria of:

Consistency with the claims of the paper and the results it presents
Completeness insofar as possible, supporting all evaluated claims of the paper
Documentation to allow easy reproduction by other researchers
Reusability, facilitating further research

Installing, configuring, and testing unknown software of research quality is difficult. Please see our Recommendations on packaging and documenting your artifact in a way that is easier for the AEC to evaluate.

Artifact evaluation begins with authors of (conditionally) accepted POPL papers submitting artifacts on HotCRP (https://popl23ae.hotcrp.com). Artifact evaluation is optional. Authors are strongly encouraged to submit an artifact to the AEC, but not doing so will not impact their paper acceptance. Authors may, but are not required to, provide their artifacts to paper reviewers as supplemental materials.

Artifacts are submitted as a stable URL or, if that is not possible, as an uploaded archive. We recommend using a URL that you can update in response to reviewer comments, to fix issues that come up. Additionally, authors are asked to enter topics, conflicts, and “bidding instructions” to be used for assigning reviewers. You must check the “ready for review” box before the deadline for your artifact to be considered.

Artifact evaluation is single blind. Artifacts should not collect any telemetry or logging data; if that is impossible to ensure, logging data should not be accessed by authors. Any data files included with the artifact should be anonymized.

Reviewers will be instructed that they may not publicize any part of an artifact during or after completing evaluation, nor retain any part of one after evaluation. Thus, authors are free to include models, data files, proprietary binaries, etc. in your artifact.

AEC Membership

The AEC will consist of roughly 30 members, mostly senior graduate students, postdocs, and researchers. As the future of our community, graduate students will be the ones reproducing, reusing, and building upon the results published at POPL 2021. They are also better positioned to handle the diversity of systems that artifacts span.

Participation in the AEC demonstrates the value of artifacts, provides early experience with the peer review process, and establishes community norms. We therefore seek to include a broad cross-section of the POPL community on the AEC.

Two-Phase Evaluation

This year, the artifact evaluation process will proceed in two phases.

In the first “kick the tires” phase reviewers download and install the artifact (if relevant) and exercise the basic functionality of the artifact to ensure that it works. We recommend authors include explicit instructions for this step. Failing the first phase—so that reviewers are unable to download and install the artifact—will prevent the artifact from being accepted.

In the second “evaluation” phase reviewers systematically evaluate all claims in the paper via procedures included in the artifact to ensure consistency, completeness, documentation, and reusability. We recommend authors list all claims in the paper and indicate how to evaluate each claim using the artifact.

Reviewers and authors will communicate back and forth during the review process over HotCRP. We have set up HotCRP to allow reviewers to ask questions or raise issues: those questions and issues will immediately be forwarded to authors, who will be able to answer questions or implement fixes.

After the two-phase evaluation process, the AEC will discuss each artifact and notify authors of the final decision.

Two days separate the AEC notification from the camera ready deadline for accepted papers. This gap allows authors time to update their papers to indicate artifact acceptance.

Badges

The AEC will award three ACM standard badges. Badges are added to papers by the publisher, not by the authors.

ACM’s Artifacts Evaluated – Reusable Badge
ACM’s Artifacts Evaluated – Functional Badge
ACM’s Artifacts Available Badge

Per ACM policy (https://www.acm.org/publications/policies/artifact-review-and-badging-current), papers will receive at most one of the Reusable and Functional badges.

In particular, artifacts that receive above average scores and are made available in a way that enables reuse (including source code availability, open source licensing, and an open issue tracker), such as via GitHub, GitLab, or BitBucket, will be awarded the “Artifacts Evaluated - Reusable” badge.

Artifacts which pass review but do not qualify for the Reusable badge will be awarded the “Artifacts Evaluated - Functional” badge instead.

Finally, artifacts which the authors make available eternally on a publicly accessible archival repository such as Zenodo or ACM DL will also receive ACM’s “Artifacts Available” badge. An immutable snapshot does not prevent authors from also distributing their code in another way, on an open source platform or on their personal websites.

These recommendations represent our view of how to package and document your artifact in a way that makes successful evaluation most likely. The guidelines are not mandatory, and diverging from them will not disqualify your artifact.

Packaging

We recommend creating a single web page at a stable URL from which reviewers can download the artifact and which also hosts instructions for installing and using the artifact. Having a stable URL, instead of uploading an archive, allows you to update the artifact in response to issues that come up.

Zenodo is a great way to create a stable URL for your artifact. Not only can you upload multiple versions in response to reviewer comments, you can use the same stable URL when publishing your paper to avoid uploading your artifact twice.

We recommend packaging your artifact as a virtual machine image. Virtual machine images avoid issues with differing operating systems, versions, or dependencies. Other options for artifacts (such as source code, binary installers, web versions, or screencasts) are acceptable but generally cause more issues for reviewers and thus more issues for you. Virtual machines also protect reviewers from malfunctioning artifacts damaging their computer.

VirtualBox is free and widely-available virtual machine host software; it is a good choice. Recent graphical linux releases, such as Ubuntu 20.04 LTS, are good choices for the guest OS: the reviewer can easily navigate the image or install additional tools, and the resulting virtual machines are not too large to download.

The virtual machine should contain everything necessary for artifact evaluation: your source code, any compiler and build system your software needs, any platforms your artifact runs on (a proof assistant, or Cabal, or the Android emulator), any data or benchmarks, and all the tools you use to summarize the results of experiments such as plotting software. Execute all the steps you expect the reviewer to do in the virtual machine: the virtual machine should already have dependencies pre-downloaded, the software pre-compiled, and output files pre-generated. This quick-starts the evaluation.

Insofar as possible, do not include anything in the artifact that would compromise reviewer anonymity, such as telemetry or analytics.

If reasonable, have your artifact output data like benchmark times or evaluation results to a file instead of to a console window.

Further advice for particular types of artifacts can be found online:

SIGPLAN Empirical Evaluation Guidelines
Proof Artifact Guidelines
Checking Machine-Checked Proofs

Like the rest of these recommendations, these links are general guidelines and may not be possible for your artifact.

Documentation

Besides the artifact itself, we recommend your documentation contain four sections:

A complete list of claims made by your paper
Download, installation, and sanity-testing instructions
Evaluation instructions
Additional artifact description

Artifact submissions are single-blind: reviewers will know the authors for each artifact, so there is no need to expend effort to anonymize the artifact. If you have any questions about how best to package your artifact, please don’t hesitate to contact the AEC chairs, at leonidas@umd.edu and raghotha@usc.edu.

List of claims

The list of claims should list all claims made in the paper. For each claim, provide a reference to the claim in the paper, the portion of the artifact evaluating that claim, and the evaluation instructions for evaluating that claim. The artifact need not support every claim in the paper; when evaluating the completeness of an artifact, reviewers will weigh the centrality and importance of the supported claims. Listing each claim individually provides the reviewer with a checklist to follow during the second, evaluation phase of the process. Organize the list of claims by section and subsection of the paper. A claim might read,

Theorem 12 from Section 5.2 of the paper corresponds to the theorem “foo” in the Coq file “src/Blah.v” and is evaluated in Step 7 of the evaluation instructions.

Some artifacts may attempt to perform malicious operations by design. Boldly and explicitly flag this in the instructions so AEC members can take appropriate precautions before installing and running these artifacts.

Reviewers expect artifacts to be buggy, immature, and have obscure error messages. Explicitly listing all claims allows the author to delineate which bugs invalidate the paper’s results and which are simply a normal part of the software engineering process.

Download, installation, and sanity-testing

The download, installation, and sanity-testing instructions should contain complete instructions for obtaining a copy of the artifact and ensuring that it works. List any software the reviewer will need (such as virtual machine host software) along with version numbers and platforms that are known to work. Then list all files the reviewer will need to download (such as the virtual machine image) before beginning. Downloads take time, and reviewers prefer to complete all downloads before beginning evaluation.

Note the guest OS used in the virtual machine, and any unusual modifications made to it. Explain its directory layout. It’s a good idea to put your artifact on the desktop of a graphical guest OS or in the home directory of a terminal-only guest OS.

Installation and sanity-testing instructions should list all steps necessary to set up the artifact and ensure that it works. This includes explaining how to invoke the build system; how to run the artifact on small test cases, benchmarks, or proofs; and the expected output. Your instructions should make clear which directory to run each command from, what output files it generates, and how to compare those output files to the paper. If your artifact generates plots, the sanity testing instructions should check that the plotting software works and the plots can be viewed.

Helper scripts that automate building the artifact, running it, and viewing the results can help reviewers out. Test those scripts carefully—what do they do if run twice?

Aim for the download, installation, and sanity-testing instructions to be completable in about a half hour. Remember that reviewers will not know what error messages mean or how to circumvent errors. The more foolproof the artifact, the easier evaluation will be for them and for you.

Evaluation instructions

The evaluation instructions should describe how to run the complete artifact, end to end, and then evaluate each claim in the paper that the artifact supports. This section often takes the form of a series of commands that generate evaluation data, and then a claim-by-claim list of how to check that the evaluation data is similar to the claims in the paper.

For each command, note the output files it writes to, so the reviewer knows where to find the results. If possible, generate data in the same format and organization as in the paper: for a table, include a script that generates a similar table, and for a plot, generate a similar plot.

Indicate how similar you expect the artifact results to be. Program speed usually differs in a virtual machine, and this may lead to, for example, more timeouts. Indicate how many you expect. You might write, for example:

The paper claims 970/1031 benchmarks pass (claim 5). Because the program runs slower in a virtual machine, more benchmarks time out, so as few as 930 may pass.

Reviewers must use their judgement to check if the suggested comparison is reasonable, but the author can provide expert guidance to set expectations.

Explicitly include commands that check soundness, such as counting admits in a Coq code base. Explain any checks that fail.

Aim for the evaluation instructions to take no more than a few hours. Clearly note steps that take more than a few minutes to complete. If the artifact cannot be evaluated in a few hours (experiments that require days to run, for example) consider an alternative artifact format, like a screencast.

Additional artifact description

The additional description should explain how the artifact is organized, which scripts and source files correspond to which experiments and components in the paper, and how reviewers can try their own inputs to the artifact. For a mechanical proof, this section can point the reviewer to key definitions and theorems.

Expect reviewers to examine this section if something goes wrong (an unexpected error, for example) or if they are satisfied with the artifact and want to explore it further.

Reviewers expect that new inputs can trigger bugs, flag warnings, or behave oddly. However, describing the artifact’s organization lends credence to claims of reusability. Reviewers may also want to examine components of the artifact that interest them.

Remember that the AEC is attempting to determine whether the artifact meets the expectations set by the paper. (The instructions to the committee are available here.) Package your artifact to help the committee easily evaluate this.

Introduction and Scope

Thank you for volunteering to serve on the POPL 2023 Artifact Evaluation Committee. Artifacts are an important product of the scientific process, and it is your goal, as members of the AEC, to study these artifacts and determine whether they meet the expectations laid out in the paper and adequately support its central claims.

We emphasize that artifact evaluation is neither adversarial nor an examination of the scientific merits of the paper. Our question is narrow: Does the artifact meet the expectations laid out in the paper? As far as possible, in cooperation with the authors, and without sacrificing scientific rigor, we aim to accept all artifacts which satisfy this requirement.

Badges and Acceptance Criteria

We will be awarding two badges, Functional and Reusable, based on the following considerations:

Functional artifacts: Does the artifact completely support all evaluated claims in the paper, and are the results from running the artifact consistent with the claims made in the paper?
Reusable artifacts: Are you able to modify the artifact to solve problems and benchmarks different from those in the paper? Is the artifact sufficiently well-documented to support reuse and future research?

Note that award of the Reusable badge is strictly superior to merely awarding the Functional badge. Accordingly, the documentation requirements for the Reusable badge are higher than the requirements for award of the Functional badge.

Overview of Process

In contrast to previous years, we will auto-register all (conditionally) accepted papers for artifact evaluation. Immediately after the announcement of conditionally accepted papers, you will have an opportunity to examine the list of papers, and place bids for the artifacts that most closely match your research background, interests, and experience. Based on your bids, and depending on which paper authors actually choose to submit artifacts, we will assign you with two or three artifacts to review.

We will organize the evaluation process as the following sequence of three milestones:

Milestone 1: Kick the Tires (By Oct 14)

Research software is delicate and needs careful setup. In order to ease this process, in the first phase of artifact evaluation, you will be expected to, at the very least, install the artifact and run a minimum set of commands to determine whether it runs as explained in the documentation. During this process, we expect you to:

Read the paper, and answer the following questions:

Q1: What is the central contribution of the paper?
Q2: What claims do the authors make of the artifact, and how does it connect to Q1 above?
Q3: Can you list the specific, significant experimental claims made in the paper (such as figures, tables, etc.)?
Q4: What do you expect as a reasonable range of deviations for the experimental results?

Install the artifact, and answer the following questions:

Q5: Are you able to install and test the artifact as indicated by the authors in their “kick the tires” instructions?
Q6: Are there any significant modifications you needed to make to the artifact while answering Q5?
Q7: For each claim highlighted in Q3 above, do you know how to reproduce the result, using the artifact?
Q8: Is there anything else that the authors or other reviewers should be aware of?

Following this, you will leave a comment on HotCRP indicating success, or ask the authors questions. These questions can concern unclear commands, or error messages that you encounter. The authors will have a chance to respond, fix bugs in their artifact or distribution, or make additional clarifications. None of this will be counted against the artifact. Remember, the evaluation process is cooperative!

Milestone 2: Evaluating Functionality (By Oct 21)

After the kick-the-tires phase, you will perform an actual review of the artifact. During this phase, you will answer the following questions:

Q9: Does the artifact provide evidence for all the claims you noted in Q3? This corresponds to the completeness criterion of your evaluation.
Q10: Do the results of running / examining the artifact meet your expectations after having read the paper? This corresponds to the criterion of consistency between the paper and the artifact
Q11: Is the artifact well-documented, to the extent that answering questions Q5–Q10 is straightforward?

Note that in some research areas, questions Q1–Q11 may be inappropriate or irrelevant. In these cases, we encourage you to disregard our suggestions and review the artifact as you think is most appropriate.

Milestone 3: Evaluating Reusability (By Oct 28)

Finally, you will evaluate artifacts for reusability in new settings. You will principally focus on the following question:

Q12: Are you able to modify the benchmarks / artifact to run simple additional experiments, similar to, but beyond those discussed in the paper?

Writing Reviews (By Oct 28) and Discussion (Oct 28–Oct 31)

You will write a review by drawing on your answers to Q1–Q12. Make sure to include a specific recommendation of whether to award a badge, and of which badge(s) to award. Once you submit your draft review, we will make all reviews public, so you have a chance to discuss the artifact with other reviewers, and reach a consensus. You should feel free to change your mind or revise your review during these discussions.

We believe the dissemination of artifacts benefits our science and engineering as a whole. Their availability improves replicability and reproducibility, and enables authors to build on top of each others’ work. It can also help more unambiguously resolve questions about cases not considered by the original authors.

The goal of the artifact evaluation process is two-fold: to both reward and probe. Our primary goal is to reward authors who take the trouble to create useful artifacts beyond the paper. Sometimes the software tools that accompany the paper take years to build; in many such cases, authors who go to this trouble should be rewarded for setting high standards and creating systems that others in the community can build on. Conversely, authors sometimes take liberties in describing the status of their artifacts—claims they would temper if they knew the artifacts are going to be scrutinized. This leads to more accurate reporting.

Our hope is that eventually, the assessment of a paper’s accompanying artifacts will guide the decision-making about papers: that is, the Artifact Evaluation Committee (AEC) would inform and advise the Program Committee (PC). This would, however, represent a radical shift in our conference evaluation processes; we would rather proceed gradually. Thus, artifact evaluation is optional, and authors choose to undergo evaluation only after their paper has been conditionally accepted. Nonetheless, feedback from the Artifact Evaluation Committee can help improve the both the final version of the paper and any publicly released artifacts. The authors are free to take or ignore the AEC feedback at their discretion.

Beyond helping the community as a whole, it confers several direct and indirect benefits to the authors themselves. The most direct benefit is, of course, the recognition that the authors accrue. But the very act of creating a bundle that can be evaluated by the AEC confers several benefits:

The same bundle can be publicly released or distributed to third-parties.
A bundle can be used subsequently for later experiments (e.g., on new parameters).
The bundle simplifies having to re-run the system subsequently when, say, having to respond to a journal reviewer’s questions.
The bundle is more likely to survive being put in storage between the departure of one student and the arrival of the next.

Creating a bundle that meets all these properties can be onerous and therefore, the process we describe below does not require an artifact to have all these properties. It offers a route to evaluation that confers fewer benefits for vastly less effort.

Artifact EvaluationPOPL 2023

Process

Author Recommendations

Reviewer Guidelines

Background

Leonidas LampropoulosArtifact Evaluation Co-Chair

University of Maryland, College Park

Greece

Mukund RaghothamanArtifact Evaluation Co-Chair

University of Southern California

Cezar-Constantin Andrici

MPI-SP

Germany

Jaime Arias

CNRS; LIPN; Université Sorbonne Paris Nord

France

Aleš Bizjak

Aarhus University

Denmark

Marco Campion

University of Verona

Italy

Stefanos Chaliasos

Imperial College London

United Kingdom

Germán Andrés Delbianco

Nomadic Labs

France

Robert Dickerson

Purdue University

United States

Matthías Páll Gissurarson

Chalmers University of Technology, Sweden

Sweden

Andrew Habib

SnT, University of Luxembourg

Luxembourg

Hans-Dieter Hiep

Leiden Institute of Advanced Compter Science (LIACS) & Centrum Wiskunde Informatica (CWI)

Netherlands

Jason Z.S. Hu

McGill University

Canada

Junyoung Jang

McGill University

Canada

Pankaj Kumar Kalita

IIT Kanpur

India

Ryan Kavanagh

McGill University

Canada

Ameya Ketkar

Uber

United States

Ravi Mangal

Carnegie Mellon University

United States

Daniel Marshall

University of Kent, UK

United Kingdom

Orestis Melkonian

University of Edinburgh

United Kingdom

Cameron Moy

Northeastern University

United States

Sujit Kumar Muduli

IIT Kanpur

India

Ramana Nagasamudram

Stevens Institute of Technology

United States

Mário Pereira

NOVA LINCS & DI -- Nova School of Science and Technology

Deena Postol

University of Maryland

Baber Rehman

University of Hong Kong

Qingkai Shi