A Text Mining Approach to Confidential Document Detection for Data Loss Prevention

Data loss prevention (DLP) systems aim to automatically detect and protect confidential or sensitive information in an organization, for example when it is accidentally leaked by email. Current state-of-the-art DLP systems employ rudimentary content analysis techniques such as regular expression matching to detect sensitive content. The detection accuracy of these current approaches remains very limited for unstructured text, due to the high level of ambiguity and idiosyncracy in human languages. In this paper, we propose problem-specific text mining techniques to assess the sensitivity of documents. Our case study on a corpus of more than 900 confidential documents shows that a lightweight classier with problem-specific features outperform existing methods by at least 10 percentage points.

By: Youngja Park

Published in: RC25055 in 2010


