Gale-Church Filter: Cleaning noisy parallel data for machine translation

The Gacha filter cleans out sentence pairs that have global character mean lower than a certain threshold. Use this cleaner to produce low quantity of high quality sentence pairs. It is an aggressive cleaner that cleaned out ~64% of the HindEnCorp during WMT14 when threshold is set at 20% (Tan and Pal, 2014); achieving lowest TER. (see http://www.aclweb.org/anthology/W/W14/W14-3323.pdf)

 

URL: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/other/gach...

authors: