TsinghuaAligner: A Statistical Bilingual Word Alignment System


Note: The size of the manually aligned Chinese-English parallel corpus has been increased from 900 to 40,715 sentence pairs.

Introduction

Word alignment is a natural language processing task that identifies the correspondence between words in two languages. TsinghuaAligner is a statistical bilingual word alignment system developed by the Natural Language Processing Group at Tsinghua University. It takes a set of sentence pairs that are translations of each other as input and produces word alignment automatically. TsinghuaAligner has the following features:

Online Demo

Please click here to play with the online demo. The demo only supports Chinese and English now.

System Requirements

TsinghuaAligner supports Linux i686 and Max OSX. You need to install the following third-party software to build TsinghuaAligner:

User Manual

This document describes how to install and use TsinghuaAligner and the technical details.

Downloads

The source code and datasets are FREE to download.

Link Size Description Date
TsinghuaAligner.tar.gz 715KB the package contains the source code of the system and example datasets 2015/04/22
model.ce.tar.gz 57MB A Chinese-English model that can be used by TsinghuaAligner 2014/10/07
Chinese-English training set 43MB Training set (700K sentence pairs from the United Nations and Hong Kong government websites) 2014/10/07
Chinese-English evaluation set 4.7MB development and test sets (40,715 sentence pairs with manual annotation) 2018/10/13

Here is a list of institutions that downloaded TsinghuaAligner for research use (by 2018/10/13).

History

References

Yang Liu and Maosong Sun. 2015. Contrastive Unsupervised Word Alignment with Non-Local Features. In Proceedings of AAAI 2015, Austin, Texas, January. [paper][arXiv][slides]

Contact

Yang Liu

Acknowledgements