Abstract
Transformer-based models have demonstrated substantial potential in medical image segmentation tasks due to their exceptional ability to capture long-range dependencies. To further enhance segmentation performance, various effective methods have been proposed, including pretraining methods (weakly supervised or self-supervised pretraining schemes), contrastive learning schemes, and knowledge distillation methods. However, segmenting esophageal cancer (EC) from CT images remains a significant challenge, partly due to the complex anatomy of EC, such as variable shapes, extensive extents, and often blurred boundaries with adjacent anatomical structures. In this study, we propose a prior-guided pretraining (PGP) regimen based on bounding boxes, which enhances the model’s ability to discern textural differences between EC and the surrounding tissues. Using Swin UNITR as the backbone, our proposed pretraining scheme demonstrates superior performance in EC segmentation compared to other schemes. To further improve the segmentation accuracy of EC, we also addressed the class imbalance and long-tail problems inherent in EC segmentation, thereby further enhancing segmentation performance.