The Gacha filter cleans out sentence pairs that have global character mean lower than a certain threshold. Use this cleaner to produce low quantity of high quality sentence pairs. It is an aggressive cleaner that cleaned out ~64% of the HindEnCorp during WMT14 when threshold is set at 20% (Tan and Pal, 2014); achieving lowest TER. (see http://www.aclweb.org/anthology/W/W14/W14-3323.pdf)
Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
Liling Tan, Marcos Zampieri, Nikola Ljubešic, Jörg Tiedemann. 2014. Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora: Building Resources for Machine Translation Research. Reykjavik, Iceland.
Marcos Zampieri, Liling Tan and Constance Wang. 2014. Translation Clouds: Producing Word Clouds from Non-aligned Parallel Corpora [abstract]. In Proceedings of the 6th International Conference on Corpus Linguistics (CILC6). Las Palmas de Gran Canaria, Spain.
Liling Tan, Anne Schumann, Jose M.M. Martinez and Francis Bond. 2014. Sensible: L2 Translation Assistance by Emulating the Manual Post-Editing Process. In Proceedings of the Eighth International Workshop on Semantic Evaluation (SemEval 2014). Dublin, Ireland.