Syntax Augmented Machine Translation via Chart Parsing
Latest version: HERE
This is a Hadoop-based MapReduce-parallelized version (see our IWSLT'08 paper). Check file setup-commands.txt for installation instruction. The readme file still refers to the old non-Hadoop SAMT version. Instead, read the "Grammar based statistical MT on Hadoop" paper below for usage instructions. References:
Our open-source SAMT system consists of three parts:
- Extraction of statistical translation rules from a training corpus; either plain hierarchical rules a la Chiang (2005) or syntax-augmented rules a la Zollmann&Venugopal (2006).
- CKY+ (Chappelier and Rajman, 1998) style chart-parser employing the statistical translation rules to translate test sentences
- Fast C++ code - translates the 2000 (realtest) sentences of the Europarl French-English data in approx. 40 min, i.e., 46 sentences per minute, achieving state-of-the-art scores
- Implements CKY+ for internal binarization during parsing
- Can efficiently handle thousands of non-terminal categories
- Performs LM intersection with the grammar at run-time, or optionally uses future cost estimates for LM cost, producing state-of-the-art scores
- A minimum-error-rate optimization and scoring tool (integrated into the chart parser) to tune the parameters of the underlying log-linear model on a held-out development corpus
The system is available open-source under the GNU General Public License. Click here to download it. (Library LGPL version [needed if used for commercial purposes, no support provided]: here.) Documentation for the SAMT is available by consulting the following sources.
- Readme.html documentation Detailed instructions on installation of the system and running through an quick-start example.
- Detailed technical overview at the top of FastTranslateChart.cc, complements the published work
- Doxygen comments on classes/functions + detailed notes in code
- The samt-technical mailing list (see below), for all the points we forgot to explain fully
We will regularly updating the SAMT system. We have created the following Google groups to manage announcements, and host technical discussions regarding the system. - samt-announce to receive information on major updates.
- samt-technical to participate in technical discussion regarding the SAMT system. Get your compiling / running / theory questions answered here.
Of course, you also can email us directly: {zollmann or ashishv} (at) cs.cmu.edu
InterACT homepage
Andreas's homepage
Ashish's homepage