Software outcome: Invitation-Based Data Selection for SMT

Based on the work of Hoang Cuong (EXPERT ESR)  and Khalil Sima’an as reported in [1], a data selection tool for domain adaptation and data selection for SMT was implemented by Amir Kamran (SLPL Lab., ILLC, University of Amsterdam). The developed tool is available at the following Github repository:

Description of the tool

The Invitation-based data selection approach exploits a sample of in-domain data (both monolingual and bilingual) as prior to guide word alignment and phrase pair estimates in the large mix-domain corpus. As a by-product, accurate estimates are obtained for the sentence pairs in the mixed-domain data to be in-domain or out-of-domain, which can be used to rank the sentences in mix-domain according to their relevance to the in-domain task or to used for other purposes.

The re-implemenation was conducted at ILLC (Institute for Logic, Language and Computation, University of Amsterdam) in part also within the project "Data-Powered Domain-Specific Translation Services On Demand", supported by the grant "STW Open Technologieprogramma".

[1] Hoang, Cuong and Sima'an, Khalil (2014): Latent Domain Translation Models in Mix-of-Domains Haystack, Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics,