C. Rossow, C. Dietrich, C. Kreibich, C. Grier, V. Paxson, N. Pohlmann, H. Bos, M. van Steen:,
“Prudent Practices for Designing Malware Experiments: Status Quo and Outlook”.
33rd IEEE Symposium on Security and Privacy,
San Francisco, CA, USA
Malware researchers rely on the observation of malicious code in execution to collect datasets for a wide array of experiments, including generation of detection models, study of longitudinal behavior, and validation of prior research. For such research to reflect prudent science, the work needs to
address a number of concerns relating to the correct and representative use of the datasets, presentation of methodology in a fashion sufficiently transparent to enable reproducibility, and due consideration of the need not to harm others. In this paper we study the methodological rigor and prudence in 36 academic publications from 2006–2011 that rely on malware execution. 40% of these papers appeared in the 6 highest-ranked academic security conferences. We find frequent shortcomings, including problematic assumptions regarding the use of execution-driven datasets (25% of the papers), absence of description of security precautions taken during experiments (71% of the articles), and oftentimes insufficient description of the experimental setup. Deficiencies occur in top-tier venues and elsewhere alike, highlighting a need for the community to improve its handling of malware datasets. In the hope of aiding authors, reviewers, and readers, we frame guidelines regarding transparency, realism, correctness, and safety for collecting and using malware datasets.
Observing the host- or network-level behavior of malware as it executes constitutes an essential technique for researchers seeking to understand malicious code. Dynamic malware analysis systems like Anubis , CWSandbox  and others [16, 22, 27, 36, 42] have proven invaluable in
generating ground truth characterizations of malware behavior. The anti-malware community regularly applies these ground truths in scientific experiments, for example to evaluate malware detection technologies [2, 10, 17, 19, 24, 26, 30, 33, 44, 48, 52–54], to disseminate the results of large-scale
malware experiments [6, 11, 42], to identify new groups of malware [2, 5, 38, 41], or as training datasets for machine learning approaches [20, 34, 35, 38, 40, 41, 47, 55]. However, while analysis of malware execution clearly holds importance for the community, the data collection and subsequent
analysis processes face numerous potential pitfalls.