Abstract
An enormous amount of information is constantly generated by scientists in various branches of science as a result of research conducted especially in the field of Biology. These research outcomes are reported in journal and conference articles. For example, Pubmed currently stores millions of abstracts and is growing at a rapid pace. Given such a large repository, one of the challenges for any biologist will be to search for articles that will likely have specific information that (s) he is looking for. A computational tool that can come up with a short list of papers that are likely to contain the information of interest will be of great use to any scientist. In this paper we present generic computational techniques that can be used to build such tools. A typical tool that we envision will take as input a set of keywords (that characterize the information of interest) and will develop a learner that is capable of classifying papers into two types. A Type 1 paper does have information of interest and a Type 2 paper does not. It is noteworthy that there are tools reported in the literature that are similar to what we study in this paper. An example is the TextMine algorithm of [11]. We show that our algorithms yield better results than TextMine. For each PubMed paper, the TextMine algorithm computes the likelihood of this paper containing information on minimotifs. As a result, the algorithm assigns a score for each paper. Those papers that have a score above a threshold will be output for the biologists to read manually. TextMine has proven to be a very valuable tool for enhancing the minimotif database of the MnM system [12] [13].