N. Demir, M. Große-Kampmann, T. Holz, N. Pohlmann, T. Urban, C. Wressnegger:
“Reproducibility and Replicability of Web Measurement Studies”.
In Proceedings „The Web Conference 2022 (WWW)”,
Web measurement studies can shed light on not yet fully understood phenomena and thus are essential for analyzing how the modern Web works. This often requires building new and adjustinng existing crawling setups, which has led to a wide variety of analysis tools for different (but related) aspects. If these efforts are not sufficiently documented, the reproducibility and replicability of the measurements may suffer—two properties that are crucial to sustainable research. In this paper, we survey 117 recent research papers to derive best practices for Web-based measurement studies and specify criteria that need to be met in practice. When applying these criteria to the surveyed papers, we find that the experimental setup and other aspects essential to reproducing and replicating results are often missing. We underline the criticality of this finding by performing a large-scale Web measurement study on 4.5 million pages with 24 different measurement setups to demonstrate the influence of the individual criteria. Our experiments show that slight differences in the experimental setup directly affect the overall results and must be documented accurately and carefully.
As the Web has grown to an essential part of our day-to-day life, the complexity of the employed web applications has increased drastically. This development has been accompanied by undesirable practices, such as user tracking [19, 32, 55], fingerprinting [21, 45], or even outright malicious activities, such as XSS attacks . Web measurement studies are an essential tool to understand, identify, and quantify such threats, and they allow us to explore isolated phenomenons at a large scale. As the modern Web is highly dynamic and ever-changing, this is an inherently difficult task. To conduct studies across thousands of websites, researchers can partly rely on crawling frameworks such as OpenWPM , but more often,
they have to extend existing work or build new crawlers on their own to adapt to new developments on the Web. This trend, however, raises the question of whether different measurement studies using different frameworks for gathering data are comparable and to which extent experiments can be reproduced or replicated. In particular, in the field of Web-based measurements, ensuring replicability requires a tremendous effort to describe, document, and openly communicate the details of the experimental setup and implementations. However, if the community cannot verify and reenact drawn conclusions, the entire scientific process is at risk of becoming unreliable – something that has unfortunately been observed in different research disciplines in the past [17, 31, 52]. In this work, we systematize such effects, provide best practices and criteria that help design studies, and additionally perform a large-scale Web measurement study that highlights the impact of these subtle differences. In particular, we survey 117 research papers published at top-tier security and privacy venues in the past six years. Based on this survey, we factor out common fundamental principles for Web measurements and establish common guidelines for conducting such experiments. We define criteria that help designing experimental setups that are reproducible and replicable.
By applying these criteria to the analyzed papers, we find that the documentation of the experimental setups is often neglected and does not fulfill the community’s expectations of a Web measurement study (see Section 4). In a large-scale study for which we visit 4.5 million pages on over 8,800 sites with 24 browser profiles, we show that slight changes in the experimental setup alters the results to an extent where cross-comparability of studies is not feasible (see Section 4). For example, we find that the identified trackers on pages can vary by 25% based on the used browser configuration.
In summary, we make the following contributions:
• Guidelines for Web measurements. We highlight the challenges of designing Web measurements and provide guidelines that help setting up experiments that effectively address them
• Prevalence study. We perform a survey of 117 security and privacy papers from 2016–2021 that perform Web measurements and show that our described challenges affect most of them.
• Impact analysis. To increase the comparability of future and previous Web measurements, we perform experiments utilizing 24 measurement setups and compare the measured differences that emerge from the utilized frameworks.