2021 IEEE 17th International Conference on eScience (eScience)
Download PDF

Abstract

Scientific publications constitute an extremely valuable repository of knowledge and collection of facts crucial to the advancement of science and development of applications, which grows as researchers learn from previous works and scientists use results in the literature to design and create. With the exponential growth of available publications, reading and extracting this wealth of information has become impractical for humans. Despite great progress in natural language processing, machine-learned solutions require large amounts of carefully annotated data for good performance. This is especially true in the context of accurately labeling and extracting complex scientific data. Towards our ultimate goal of extracting scientific facts from the literature, we first aim to identify blobs of text that contain all of the facts in a publication to be later automatically extracted or scrutinized by experts. Our previous work identified some facts missed by experts yet missed others due to the assumption that the target relation—here, a polymer and its glass transition temperature—would be contained within the same sentence. We set out to enhance our approximate labeling system to look back and ahead for missing information and successfully achieved 100% recall of scientific facts while reducing the full-text publication to 6% of its original size. Moreover, we assign confidence scores to sentences to further assist expert curators in identifying important sentences and facts locked in unstructured text.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles