An EM Algorithm for History-Based Statistical Parsers

This paper aims to improve statistical parsing by making use of partially-labelled data from a different domain. The labeled part of a parse tree is regarded as “observation” and the unlabelled part as missing information. The expectation-maximization (EM) algorithm is employed to infer missing information. Nested parser states are used to implement the E-step efficiently. We train a series of model on the UPenn Chinese treebank, and use a POS-tagged corpus from Peking University (PKU) as the EM learning data. We observe a parsing error (measured by equal-weighted label F-measure) reduction by as much as about one-third when the seed model is under-trained. The usefulness of PKU data, as expected, decreases as the seed model is trained with more labeled data.

By: Xiaoqiang Luo, Min Tang, Salim Roukos, Todd Ward

Published in: RC23121 in 2004


This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.


Questions about this service can be mailed to .