Abstract
The rapid growth of scientific software development has led to the emergence of large and complex codebases, making it challenging to search, find, and compare software repositories within the scientific research community. In this paper, we propose a solution by leveraging deep learning techniques to learn embeddings that capture semantic similarities among repositories. Our approach focuses on identifying repositories with similar semantics, even when their code fragments and documentation exhibit different syntax. To address this challenge, we introduce two complementary open-source tools: RepoSim and RepoSnipy. RepoSim is a command-line toolbox designed to represent repositories at both the source code and documentation levels. It utilizes the UniXcoder pre-trained language model, which has demonstrated remarkable performance in code-related understanding tasks. RepoSnipy is a web-based neural semantic search engine that utilizes the powerful capabilities of RepoSim and offers a user-friendly search interface, allowing researchers and practitioners to query public repositories hosted on GitHub and discover semantically similar repositories. RepoSim and RepoSnipy empower researchers, developers, and practitioners by facilitating the comparison and analysis of software repositories. They not only enable efficient collaboration and code reuse but also accelerate the development of scientific software.