2023 IEEE 19th International Conference on e-Science (e-Science)
Download PDF

Abstract

In this paper, we summarize our effort to create and utilize an integrated framework to coordinate computational AI analytics tasks with the help of a task and experiment management workflow system. Our design is based on a minimalistic approach while at the same time allowing access to hybrid computational resources offered through the owner's computer, HPC computing centers, cloud resources, and distributed systems in general. Access to this framework includes a GUI for monitoring and managing the workflow, a REST service, a command line interface, as well as a Python interface. It uses a template-based batch management system that, through configuration files, easily allows for the generation of reproducible experiments while creating permutations over selected experiment parameters as typical in deep learning applications. The resulting framework was developed for analytics workflows targeting MLCommons benchmarks of AI applications on hybrid computing resources, as well as an educational tool for teaching scientists and students sophisticated concepts to execute computations on resources ranging from a single computer to many thousands of computers as part of on-premise and cloud infrastructure. We demonstrate the usefulness of the tool while creating FAIR principle-based application accuracy benchmark generation for the MLCommons Science Working Group Cloudmask application. The code is available as an open-source project in GitHub and is based on an easy-to-enhance framework called Cloudmesh. It can be applied to other applications easily.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles