Abstract
Deep learning models thrive with high amounts of data where the classes are, usually, appropriately balanced. In medical imaging, however, we often encounter the opposite case. Wireless Capsule Endoscopy is not an exception; even if huge amounts of data could be obtained, labeling each frame of a video could take up to twelve hours for an expert physician. Those videos would show no pathologies for most patients, while a minority would have a few frames with associated pathology. Overall, there would be low amounts of data and a great unbalance. Self-supervised learning provides means to use unlabelled data to initialize models that can perform better even under the described circumstance. We propose a novel contrastive loss derived from Triplet Loss, crafted to leverage temporal information in endoscopy videos. We show that our model outperforms existing models and other contrastive methods in several tasks.