Abstract
Sequential patterns are used to discover knowledge in a wide range of applications. However, in many scenar- ios pattern quality can be low, due to short lengths or low supports. Furthermore, for dense datasets such as proteins, most of the sequential pattern mining algorithms return a tremendously large number of patterns, which are difficult to process and analyze. However, by relaxing the defini- tion of frequency and allowing some mismatches, it is pos- sible to discover higher quality patterns. We call these pat- terns Frequent Approximate Substrings or FAS-patterns and we introduce an algorithm called FAS-Miner, to handle the mining task efficiently. The experiments on real-world pro- tein and DNA datasets show that FAS-Miner can discover patterns of much longer lengths and higher supports than standard sequential mining approaches.