2023 IEEE 19th International Conference on e-Science (e-Science)
Download PDF

Abstract

Scientific computing communities increasingly run their experiments using complex data- and compute-intensive workflows that utilize distributed and heterogeneous architectures targeting numerical simulations and machine learning, often executed on the Department of Energy Leadership Computing Facilities (LCFs). We argue that a principled, systematic approach to implementing FAIR principles at scale, including fine-grained metadata extraction and organization, can help with the numerous challenges to performance reproducibility posed by such workflows. We extract workflow patterns, propose a set of tools to manage the entire life cycle of performance metadata, and aggregate them in an HPC-ready framework for reproducibility (RECUP). We describe the challenges in making these tools interoperable, preliminary work, and lessons learned from this experiment.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles