LEPOR

LEPOR is an automatic language independent machine translation evaluation metric with tunable parameters and reinforced factors.

Background

Since IBM proposed and realized the system of BLEU [1] as the automatic metric for Machine Translation (MT) evaluation, many other methods have been proposed to revise or improve it, such as TER, METEOR,[2] etc. However, there exist some problems in the traditional automatic evaluation metrics. Some metrics perform well on certain languages but weak on other languages, which is usually called as language bias problem. Some metrics rely on a lot of language features or linguistic information, which makes it difficult for other researchers to repeat the experiments. LEPOR is an automatic evaluation metric that tries to address some of the existing problems.[3] LEPOR is designed with augmented factors and the corresponding tunable parameters to address the language bias problem. Furthermore, in the improved version of LEPOR, i.e. the hLEPOR,[4] it tries to use the optimized linguistic features that are extracted from treebanks. Another advanced version of LEPOR is the nLEPOR metric,[5] which adds the n-gram features into the previous factors. So far, the LEPOR metric has been developed into LEPOR series.[6]

Design

LEPOR is designed with the factors of enhanced length penalty, precision, n-gram word order penalty, and recall. The enhanced length penalty ensures that the hypothesis translation, which is usually translated by machine translation systems, is punished if it is longer or shorter than the reference translation. The precision score reflects the accuracy of the hypothesis translation. The recall score reflects the loyalty of the hypothesis translation to the reference translation or source language. The n-gram based word order penalty factor is designed for the different position orders between the hypothesis translation and reference translation. The word order penalty factor has been proved to be useful by many researchers, such as the work of Wong and Kit (2008).[7]

Performance

LEPOR series have shown their good performances in the ACL's annual international workshop of statistical machine translation (ACL-WMT). ACL-WMT is held by the special interest group of machine translation (SIGMT) in the international association for computational linguistics (ACL). In the ACL-WMT 2013,[8] there are two translation and evaluation tracks, English-to-other and other-to-English. The "other" languages include Spanish, French, German, Czech and Russian. In the English-to-other direction, nLEPOR metric achieves the highest system-level correlation score with human judgments using the Pearson correlation coefficient, the second highest system-level correlation score with human judgments using the Spearman rank correlation coefficient. In the other-to-English direction, nLEPOR performs moderate and METEOR yields the highest correlation score with human judgments, which is due to the fact that nLEPOR only uses the concise linguistic feature, part-of-speech information, except for the officially offered training data; however, METEOR has used many other external resources, such as the synonyms dictionaries, paraphrase, and stemming, etc.

See also

Notes

  1. Papineni et al., (2002)
  2. Banerjee and Lavie, (2005)
  3. Han et al., (2012)
  4. Han et al., (2013a)
  5. Han et al., (2013b)
  6. Han et al., (2014)
  7. Wong and Kit, (2008)
  8. ACL-WMT (2013)

References

  • Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: a method for automatic evaluation of machine translation" in ACL-2002: 40th Annual meeting of the Association for Computational Linguistics pp. 311–318
  • Han, A.L.F., Wong, D.F., and Chao, L.S. (2012) "LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors" in Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012): Posters, pp. 441–450. Mumbai, India. Online paper Open source tool
  • Han, A.L.F., Wong, D.F., Chao, L.S., He, L., Lu, Y., Xing, J., and Zeng, X. (2013a) "Language-independent Model for Machine Translation Evaluation with Reinforced Factors" in Proceedings of the Machine Translation Summit XIV (MT SUMMIT 2013), pp. 215-222. Nice, France. Publisher: International Association for Machine Translation. Online paper Open source tool
  • Han, A.L.F., Wong, D.F., Chao, L.S., Lu, Y., He, L., Wang, Y., and Zhou, J. (2013b) "A Description of Tunable Machine Translation Evaluation Systems in WMT13 Metrics Task" in Proceedings of the Eighth Workshop on Statistical Machine Translation, ACL-WMT13, Sofia, Bulgaria. Association for Computational Linguistics. Online paper pp. 414–421
  • Han, A.L.F., Wong, D.F., Chao, L.S., He, L., and Lu, Y. (2014) "Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation" in The Scientific World Journal. Issue: Recent Advances in Information Technology. ISSN 1537-744X. Hindawi Publishing Corporation. Online paper
  • ACL-WMT. (2013) "ACL-WMT13 METRICS TASK"
  • Wong, B. T-M, and Kit, C. (2008). "Word choice and word position for automatic MT evaluation" in Workshop: MetricsMATR of the Association for Machine Translation in the Americas (AMTA), short paper, Waikiki, US.
  • Banerjee, S. and Lavie, A. (2005) "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments" in Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005

Software for Automated Evaluation

This article is issued from Wikipedia - version of the 1/28/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.