Abstract
Patients and healthcare providers often face communication barriers due to different languages, which can impede healthcare quality. Moreover, with digital medicine and health applications, health data in different languages becomes increasingly available and connected. Machine translation services may be suitable to bridge present language barriers. This paper aims to assess the translation quality of different machine translation services in a biomedical context. We compare the European Commission’s eTranslation service, DeepL, Google Translate, and the Watson Language Translator, using an automated, extensible, and open-source pipeline. Parallel biomedical corpora are identified and used to translate German, French, and Spanish sentences to English. We apply evaluation metrics, BLEU, ROUGE, and BLEURT, and additionally rank translations in a human validation. We found that both the evaluation metrics and the human validation indicate a higher translation quality of DeepL and Google Translate, compared to the Watson Language Translator and eTranslation. Also, human raters identified more errors in eTranslation and Watson Language Translator translations. The BLEURT metric reaches the highest agreement with the human validation. However, effect sizes of evaluation metric differences are small and the raters agree that in general translations were overall reasonable. While the translations are overall understandable, in practice it should always be clear to users that translations were done automatically and can be erroneous. Moreover, when choosing a translation service for a medical service, other features than translation quality such as speed and data security should be considered. In future versions, the proposed pipeline could be extended to include updated versions of translation services, large language models, further languages, and additional evaluation metrics.