ESR5

Gale-Church Filter: Cleaning noisy parallel data for machine translation

The Gacha filter cleans out sentence pairs that have global character mean lower than a certain threshold. Use this cleaner to produce low quantity of high quality sentence pairs. It is an aggressive cleaner that cleaned out ~64% of the HindEnCorp during WMT14 when threshold is set at 20% (Tan and Pal, 2014); achieving lowest TER. (see http://www.aclweb.org/anthology/W/W14/W14-3323.pdf)

 

Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection

Liling Tan, Marcos Zampieri, Nikola Ljubešic, Jörg Tiedemann. 2014. Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora: Building Resources for Machine Translation Research. Reykjavik, Iceland.

Explicit Holmes: A Diachronie Investigation of Explicitness and Explicitation in Chinese Translations of Detective Stories

Constance Wang, Liling Tan. 2014. Explicit Holmes: A Diachronie Investigation of Explicitness and Explicitation in Chinese Translations of Detective Stories. In Kerstin Kunz, Elke Teich, Silvia Hansen-Schirra, Stella Neumann and Peggy Daut (Editors). Caught in the Middle – Language Use and Translation: A Festschrift for Erich Steiner on the Occasion of his 60th Birthday. Germany: Saarland University Press.

Pages