Eduard Barbu, ER1
Location: Translated, Italy
Project Title: Investigation of automatic methods for collection & preparation of multilingual data
I have joined Translated company and EXPERT project in October 2014 as an Experienced Researcher (ER). Since then I have been engaged in several projects regarding data cleaning and data collection.
As far as data cleaning is concerned I have applied various techniques to correct bilingual corpora collected from the web (like EuroParl for example). All these corpora have been uploaded in the MyMemory ,the biggest translation memory in the world, database.
Further I have developed a language independent system, based on machine learning algorithms, for identifying false translations in translation memories. The system is open source and will be made public soon. Keep an eye on this page for a link to the system.
With respect to data collection I have been involved in several tasks. Because nowadays data collection is a hot topic I decided to join my efforts with other researchers interested in the topic. In particular I am working with Christian Buck from University of Edinburgh and Achim Ruopp from TAUS. We have made an estimation of the amount of multilingual data available in Common Crawl and a thorough evaluation of the available software for collecting parallel data from the web.
Further we are collecting data from the Common Crawl and from the web using various techniques. The data collected is aligned and exported in the TMX format. Pending legal issues we will release the aligned data and/or a database containing the URL of the parallel documents identified.
ESR3, Hernani Costa, does a three month secondment at Translated (October-December 2015). He is studying techniques for parallel document identification for the data crawled from the web.
Research Interests: data collection, machine translation, terminology extraction
Eduard Barbu. (2015) “Spotting false translation segments in translation memories”. In the Proceedings of the 1st International Workshop on Natural Language Processing for Translation Memories (NLPTM), pages 9-16. September 2015, Hissar, Bulgaria.http://rgcl.wlv.ac.uk/events/NLP4TM/5_Paper.pdf
Eduard Barbu, Carla Parra Escartín, Luisa Bentivogli, Matteo Negri, Marco Turchi Marcello Federico, Luca Mastrostefano, Constantin Orasan . 1st Shared Task on Automatic Translation Memory Cleaning Preparation and Lessons Learned. In Proceedings of the 2nd Workshop on Natural Language Processing for Translation Memories (NLP4TM 2016), pages 1-6, 28 May 2016, Portorož , Slovenia
- Eduard Barbu, Carla Parra Escartin, Luisa Bentivogli, Matteo Negri, Marco Turchi, Constantin Orasan, Marcello Federico (2016). The First Automatic Translation Memory Cleaning Shared Task. Machine Translation.
- Orăsan, Constantin; Parra Escartín, Carla; Barbu, Eduard; and Federico, Marcello (2016) “Proceedings of the 2nd Workshop on Natural Language Processing for Translation Memories (NLP4TM 2016)” at LREC 2016, 28 May 2016, Portorož, Slovenia
Masoud Jalili Sabet , Matteo Negri, Marco Turchi, Eduard Barbu (2016). An Unsupervised Method for Automatic Translation Memory Cleaning. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 287–292,Berlin, Germany, August 7-12.