Abstract
The increasing complexity of machine learning (ML) and artificial intelligence (AI) applications necessitates efficient GPU resource management in distributed environments such as Kubernetes. Conventional one-to-one GPU mapping, which allocates a single G PU to a single container, often results in the underutilization of these critical resources. Our study introduces an approach that leverages KubeRay and time slicing to enable dynamic GPU sharing among multiple concurrent workloads, significantly improving memory utilization and overall response times. Our findings reveal that while memory efficiency is notably enhanced, the proposed method incurs longer task completion times due to the overhead associated with managing distributed tasks. Specifically, we observed an average increase in task completion times of approximately 74.43 % with two parallel workloads. For three parallel workloads, the average increase in completion times was approximately 158.4 %. This study reveals the trade-offs between improved resource utilization and execution time, highlighting the need for future research to optimize these mechanisms in Kubernetes-based ML operations.