Abstract
Software reverse analysis is a key technology in the field of cyber-security. With the increasing scale and complexity of software, this technology is facing great challenges. Binary code modularization (BCM), as the basic work of software reverse, plays an important role in extracting semantics, narrowing the analysis scope and locating key position. The semantics of strings underlying code is a significant hint, with being processed using natural language processing and artificial intelligence technology help to reverse analysis effectively. However, most of the existing modularization methods ignore these semantics, which limits the in-depth understanding of binary code. This paper proposes a semantic-driven reverse engineering framework for binary code modularization (SBCM). Firstly, the rich string is extracted from the binary file into a large language model for semantics analysis. Then, the semantic information of the string is combined with the control flow graph to construct the function semantic graph (FSG). Subsequently, a function summary is generated based on the FSG. Finally, semantic embedding is generated for Summaries and semantic-driven integrated clustering is carried out to realize binary code modularization. The experiment results show that SBCM improves the F1 value by 12.6% on average compared with the existing methods, which proves its effectiveness and superiority in binary code modularization.