An Automatic Method to Extract Data from an Electronic Contract Composed of a Number of Documents in PDF Format

An electronic contract can encompass a large number of collateral contract documents in PDF format. These contract documents are of different contract document types and converted from different original formats. Data extraction and thus data mining for this kind of electronic contracts is very difficult. In this paper, we present a novel method to automatically extract contract data from this kind of electronic contracts. Our automatic electronic contract data extraction system comprises an administrator module, a PDF parser, a pattern recognition engine and a contract data extraction engine. The administrator module provides templates for inputting document patterns and a list of contract data tags for each contract document type. It also constructs the pattern matrices and stores them in a database. The PDF parser converts the contract PDF document into the contract text document with the insertion of formatting bookmarks, such as a new page, paragraph or line. The pattern recognition engine determines a list of contract document types in the electronic contract by comparing and matching the patterns of all known contract document types with the pattern of the contract text document. The contract data extraction engine retrieves the corresponding list of contract data tags and then extracts contract data accordingly for each contract document type on the list. Our automatic electronic contract data extraction system has found to be very accurate, efficient and useful in extracting contract data for data mining.

By: Thomas Y. Kwok; Thao Nguyen

Published in: Proceedings of the 8th IEEE International Conference on e-Commerce Technology (CEC 2006), , IEEE Computer Society, p.258-62 in 2006


