Abstract
The experiment-based catalytic residue identificationin the enzyme sequences is most accurate. However,the current experimental methods are often too expensive andlabor intensive to handle the rapidly accumulating proteinsequences and structure data. Thus accurate, high-throughputin silico methods for identifying catalytic residues and enzymefunction prediction are much needed. In this paper, we proposea new, sequence-based enzyme catalytic domain predictionmethod by using clustering and information-theoretic approaches.The first step is to perform the sequence clusteringanalysis of enzyme sequences of the same functional category(those with the same EC number). The clustering analysisconstructs a sequence graph where nodes are enzyme sequencesand edges are defined for a pair of sequences with a certaindegree of sequence similarity and uses graph properties suchas biconnected components and articulation points to generatesequence segments common to the enzyme sequences. Thenamino acid subsequences in the common shared regions arealigned and then an information theoretic approach called aggregatedcolumn related scoring scheme is performed to highlightpotential active sites in enzyme sequences. This method wassuccessful in highlighting known catalytic sites in enzymesof E. coli in terms of the Catalytic Site Atlas database.The proposed method is shown not only to be accurate inpredicting potential active sites in the enzyme sequences butalso computationally efficient since the clustering approachutilizes two graph properties that can be computed in linear tothe number edges in the sequence graph and computation ofmutual information does not require much time to compute.Webelieve that the proposed method can be useful for identifyingactive sites of enzyme sequences from many genome projects.