A Text Mining Approach to Confidential Document Detection for Data Loss Prevention

Data loss prevention (DLP) systems aim to automatically detect and protect confidential or sensitive information in an organization, for example when it is accidentally leaked by email. Current state-of-the-art DLP systems employ rudimentary content analysis techniques such as regular expression matching to detect sensitive content. The detection accuracy of these current approaches remains very limited for unstructured text, due to the high level of ambiguity and idiosyncracy in human languages. In this paper, we propose problem-specific text mining techniques to assess the sensitivity of documents. Our case study on a corpus of more than 900 confidential documents shows that a lightweight classier with problem-specific features outperform existing methods by at least 10 percentage points.

By: Youngja Park

Published in: RC25055 in 2010


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to reports@us.ibm.com .