OOPSLA ArtifactsSPLASH 2019
A Case Study on Artifact Evaluation for OOPSLA 2019
Thank you to the authors for submitting an excellent crop of artifacts to accompany their OOPSLA’19 papers, and to the 28 young researchers who served on the evauluation committee.
Our task was to assess the overall quality of artifacts. Three outcomes were possible from reviewing:
- Functional: The artifact was judged to reproduce the paper’s claims. Generally this was all of the paper’s results, special accommodations were made in some cases when there were good valid reasons for non-reproducibility.
- Reusable: A subset of Functional artifacts were further judged to be Reusable. These artifacts were particularly well-packaged, well-designed, and/or well-documented, in a way that reviewers felt gave a particularly solid starting point for future researchers to build on.
- Rejected: Papers that were not Functional were rejected. This should not be interpreted as casting doubt on the results of the accompanying paper, but merely as our inability to reproduce all sought-after results.
The artifact evaluation deadline this year was roughly a week after the notifications of Phase 1 outcomes for OOPSLA Research Papers. Authors were asked to submit artifacts in one of 2 formats:
- A compressed archive, with md5 hash at submission, uploaded to a hosting site that does not permit the authors from seeing information about who accesses the artifact.
- A pointer to a public Github/Bitbucket/Gitlab/etc. repository, with the hash of the relevant commit.
These formats permitted a range of submisison formats, including source code, virtual machine images, or container images. Docker images built from a source checkout were a popular addition this year. One set of authors submitted an artifact via DockerHub (accompanied by the sha256 hash Docker uses to compare image versions) after consultation with the chairs.
Reviewing was split into two rounds. The “Kick the Tires” round had reviewers follow author-provided instructions to validate the artifacts could compile, run and execute — without investigating their results. The goal was to quickly identify any “silly” setup issues. Authors were given a 4 day window to exchange comments with reviewers to work through any issues and even submit small corrections to artifacts or instructions. After that, the reviewers continued with the remaining author instructions to attempt to reproduce results, as well as evaluating whether all results that should be reproducible were included. During the Kick the Tires round, one set of authors who had previously contacted the AEC Chairs about special hardware requirements were permitted to switch from their original submission to an author-provided Amazon EC2 instance with the specialized hardware, to which reviewers connected.
During the second phase, reviewers focused on evaluating:
- Whether the submitted artifact could reproduce the relevant claims from the paper, as identified by authors.
- Whether the supported claims were actually adequate to say the paper’s results were reproduced. Any omissions needed a good reason, typically technical or legal encumbrances.
- For Reusability, whether the artifact formed a strong, as opposed to minimal, starting point for other groups to build upon the work. Standards applied were adjusted based on the sort of artifact (e.g., the fact that many people claim no Coq proofs are reusable was not held against Coq proofs).
We received 44 submissions, of which 9 were rejected, 17 were found to be Functional, and 18 were judged both Functional and Reusable. The rejected artifacts typically ran afoul of poor packaging issues that prevented artifacts from working (even after communication during the Kick-the-Tires round), or omitted the main benchmarks from the paper (without explanation), or omitted the ability to reproduce comparative results against baselines (e.g., tool X is Y% more accurate than tool Z on these benchmarks) when they were not simply reusing baseline numbers from earlier papers. None of the rejections cast doubt on the results of the accompanying papers.
Authors of artifacts found Functional or Reusable were given the option to apply for an Availabie badge, for making their artifact available publicly in a reliable (roughly, archival) location. This year our approach for this was simple: we asked authors interested in the availability badge to upload a version of their artifact to Zenodo, a service run in part by CERN for the archiving of scientific data sets and software. There is no cost to the authors, and every artifact is given its own unique DOI for archival purposes. Authors were instructed to upload exactly the version of the artifact that was reviewed by the AEC (plus a README and LICENSE). Similar to arXiv, Zenodo supports versioning, so while the version used for the Availability badge was exactly that reviewed, authors are free to upload improved versions, and viewing the page for the reviewed version will include indications of an update. Some authors intend to use this to improve packaging, directions, etc., following reviewer suggestions.
This year, out of the 35 accepted (Functional or Reusable) artifacts, 33 archived the reviewed version on Zenodo and consequently were awarded the Availability badge.
4 Distinguished artifacts were chosen by the AEC Chairs, based on nominations from the AEC. Artifacts with a chair as co-author were not eligible.
The chairs reread the reviews of nominated artifacts, and based on those contents selected the distinguished artifacts:
- A Path to DOT: Formalizing Fully Path-Dependent Types
- Safer Smart Contract Programming with Scilla
- Leveraging Rust Types for Modular Specification and Verification
- Generating a Fluent API with Syntax Checking from an LR Grammar
The chairs have also chosen 5 Distinguished Reviewers, based on the high quality of their reviews and participation in online discussion for artifacts.
- Jyothi Vedurada
- Fabian Muehlboeck
- Simon Fowler
- Anthony Canino
- Gabriel Radanne
Suggestions for Future Years
(1) A common problem was difficulty running on sufficiently powerful hardware. Reviewers used their primary machines with limited resources. We recommend advertising the option for authors to pay for cloud resources, though formalizing mechanisms for reviewer anonymity. This would address difficulties with insufficient RAM, disk space, lack of GPU hardware, and other limitations encountered by this year’s reviewers. (2) While this is a large ask for authors, anecdotally some of the “easiest” artifacts to get running were those where the authors had incorporated Docker into their development workflow; they simply submitted their source repository, and reviewers got exactly the environment used by the authors for experiments. Short of this, we recommend authors actually follow their instructions to reviewers before submitting artifacts: common problems like files being missing or having the wrong permissions are discovered almost immediately by reviewers, but limits the Kick-the-Tires feedback to these relatively minor issues. Several artifacts were rejected because the authors’ instructions for completely reproducing results simply didn’t work: e.g., a subset of experiments ran, but other experiments crashed or failed with errors in the same way for all reviewers. (3) One recurring problem was author confusion about the bar for artifacts being complete as specified in the call for artifacts. Some authors for example omitted most of the benchmarks from the actual paper intentionally, under the rationale that the experiments might take days or weeks to execute, but the result was that the AEC was not able to affirmatively reproduce any of the results. Another variant of this was the omission of baselines when a paper makes comparative claims against another (open source) tool. In this case, the tool used in a paper’s evaluation was often not the exact version used in prior work, and often not on the exact benchmark sets of previous work. In such situations, the baseline results obtained for the paper were new measurements made for the paper, and inability to reproduce those baselines was interpreted as grounds for lack of a Functional designation.
Help others to build upon your contributions!
The Artifact Evaluation process is a service provided by the community to help authors provide useful supplements to their papers so future researchers can build on previous work. Authors of accepted OOPSLA papers are invited to submit an artifact that supports the conclusions of their paper. The AEC will read explore the artifact to give feedback about how well it supports the paper and how easy it is to use. Submission is voluntary. Papers that go through the Artifact Evaluation process receive a seal of approval. Authors of papers with accepted artifacts are encouraged to make these materials publicly available by including them as “source materials” in the ACM Digital Library
Call for Artifacts
This process was inspired by the ECOOP 2013 AEC by Jan Vitek, Erik Ernst, and Shriram Krishnamurthi.
The artifact is evaluated in relation to the expectations set by the paper. Thus, in addition to running the artifact, evaluators will read the paper and may try to tweak inputs or otherwise slightly generalize the use of the artifact in order to test the artifact’s limits.
Artifacts should be:
- consistent with the paper,
- as complete as possible,
- well documented, and
- easy to reuse, facilitating further research.
The AEC strives to place itself in the shoes of such future researchers and then to ask: how much would this artifact have helped me?
If your paper makes it past Round 1 of the review process, you may submit an artifact consisting of three pieces:
- an overview,
- a URL pointing to either: a single file (recommended), or the address of a public repository
- a hash certifying the version of the artifact at submission time: either an md5 hash of the single file file (use the md5 or md5sum command-line tool to generate it), or the full commit hash (e.g., from
git reflog --no-abbrev)
The URL must be a Google Drive, Dropbox, Github, Bitbucket, or (public) Gitlab URL.
NEW This year we will be collaborating with a non-profit, Accelerate Publishing, whose goal is to better support several aspects of academic publishing, in particular reproducibility via creation of artifacts. During artifact submission, authors will be asked whether they are willing to share their artifact with the non-profit, and permit the non-profit to see discussions of their artifact during the review process. The goals of permitting the non-profit access to these discussions and artifacts is to give them an overview of the sources of problems in the creation of reproducible artifacts, in order to focus efforts to make artifact creation easier, less time consuming, and less error-prone. Authors do not need share their artifact* or PC discussions with Accelerate Publishing, and opting against sharing will not influence the AEC’s decision.
Your overview should consist of two parts:
- a Getting Started Guide and
- Step-by-Step Instructions for how you propose to evaluate your artifact (with appropriate connections to relevant sections of your paper);
The Getting Started Guide should contain setup instructions (including, e.g., a pointer to the VM player software, its version, passwords if needed) and basic testing of your artifact that you expect a reviewer to be able to complete in 30 minutes. Reviewers will follow all the steps in the guide during an initial kick-the-tires phase. The Guide should be as simple as possible. It should stress the key elements of your artifact. Anyone who has followed the Getting Started Guide should have no technical difficulties with the rest of your artifact.
The Step by Step Instructions explain how to reproduce any experiments or other activities that support your paper. Write this for readers who have a deep interest in your work and are studying it to improve it or compare against it. If your artifact runs for more than a few minutes, point this out and explain how to run it on smaller inputs. Where appropriate, include descriptions of and links to files that represent expected outputs (e.g., the log files expected to be generated by your tool on the given inputs); if there are warnings that are safe to be ignored, explain which ones they are.
The artifact’s documentation should include the following:
- A list of claims from the paper supported by the artifact, and how/why.
- A list of claims from the paper not supported by the artifact, and how/why.
Example: Performance claims cannot be reproduced in VM, authors are not allowed to redistribute specific benchmarks, etc. Artifact reviewers can then center their reviews / evaluation around these specific claims.
Packaging the Artifact
When packaging your artifact, please keep in mind: a) how accessible you are making your artifact to other researchers, and b) the fact that the AEC members will have a limited time in which to make an assessment of each artifact.
Your artifact can contain a bootable virtual machine image with all of the necessary libraries installed. Using a virtual machine provides a way to make an easily reproducible environment — it is less susceptible to bit rot. It also helps the AEC have confidence that errors or other problems cannot cause harm to their machines. This is recommended.
Submitting source code that must be compiled is permissible. A more automated and/or portable build — such as a Docker file or a build tool that manages all compilation and dependencies (e.g.,
gradle, etc.) — improves the odds the AEC will not be stuck getting different versions of packages working (particularly different releases of programming languages).
You should make your artifact available as a single archive file and use the naming convention
<paper #>.<suffix>, where the appropriate suffix is used for the given archive format. Please use a widely available compressed archive format such as ZIP (.zip), tar and gzip (.tgz), or tar and bzip2 (.tbz2). Please use open formats for documents.
Artifacts do not have to be anonymous.
Conflict of interests for AEC members are handled by the chairs. Conflicts of interest involving one of the two AEC chairs are handled by the other AEC chair or the PC of the conference if both chairs are conflicted. To be validated, artifacts must be unambiguously accepted and may not be considered for the distinguished artifact.