SQL-based Aggregation for Text Mining

Text mining is a technology that makes it possible to discover patterns and trends semi-automatically from huge collections of unstructured text. We developed MedTAKMI to facilitate knowledge discovery from the very large text databases characteristic of life science and healthcare applications. MedTAKMI can interactively mine a huge document collection and provide fast computations for each function. However, since MedTAKMI uses a proprietary index as a modified DTM and an proprietary aggregate engine, and is implemented in C++, it is not easy to expand it to develop other functions and to integrate it with other systems. In this paper, we propose an SQL-based method for storing annotated words in a relational database and computing each function of MedTAKMI by SQL. Although the original MedTAKMI was implemented in a few thousands of lines of C++ code, the proposed method is implemented in a few lines of SQL and is comparable with the original MedTAKMI.

By: Akihiro Inokuchi and Kohichi Takeda

Published in: RT0634 in 2007


