Cao Van Viet, Do Ngoc Quynh, Le Anh Cuong

Main Article Content

Abstract

Abstract: A language model assigns a probability to a sequence of words. It is useful for many Natural Language Processing (NLP) tasks such as machine translation, spelling, speech recognition, optical character recognition, parsing, and information retrieval. For Vietnamese, although several studies have used language models in some NLP systems, there is no independent study of language modeling for Vietnamese on both experimental and theoretical aspects. In this paper we will experimently investigate various Language Models (LMs) for Vietnamese, which are based on different smoothing techniques, including Laplace, Witten-Bell, Good-Turing, Interpolation Kneser-Ney,  and Back-off Kneser-Ney. These models will be experimental evaluated through a large corpus of texts. For evaluating these language models through an application we will build a statistical machine translation system translating from English to Vietnamese. In the experiment we use about 255 Mb of texts for building language models, and use more than 60,000 parallel sentence pairs of English-Vietnamese for building the machine translation system. b

Key words: Vietnamese Language Models; N-gram; Smoothing techniques in language models; Language models and statistical machine translation

References

[1] Adrian David Cheok, Zhang Jian, Eng Siong Chng (2008) . Efficient mobile phone Chinese optical character recognition systems by use of heuristic fuzzy rules and bigram Markov language models. Journal of Applied Soft Computing . Volumn 8(2), pp. 1005 – 1017.
[2] S. Bergsma, Dekang Lin, and Randy Goebel.( 2009). Web-scale N-gram models for lexical disambiguation. In IJCAI. Pp. 1507-1512.
[3] T. Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. (2007). Large language models in machine translation. In EMNLP. pp. 858–867.
[4] P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), pp. 263-311.
[5] E. Charniak, Kevin Knight, and Kenji Yamada. (2003) Syntax-based Language Models for Statistical Machine Translation. In Proceedings of Machine Translation Summit IX, pp. 40-46.
[6] Chen, Stanley F., and Joshua Goodman. (1996). An empirical study of smoothing techniques for language modeling. In ACL 34, pp. 3-18.
[7] S. F. Chen and J. Goodman, ``An Empirical Study of Smoothing Techniques for Language Modeling,'' TR-10-98, Computer Science Group, Harvard Univ., 1998.
[8] M. Collins , Brian Roark , Murat Saraclar. (2005) Discriminative syntactic language modeling for speech recognition, Proceedings of ACL. pp. 503-514.
[9] Herman Stehouwer, Menno van Zaanen. (2009). Language models for contextual error detection and correction. Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference. pp. 41-48.
[10] Jay M. Ponte, Bruce W. Croft. (1998) A Language Modeling Approach to Information Retrieval. In Research and Development in Information Retrieval. pp. 275-281.
[11] F. Jelinek, B. Merialdo, S. Roukos, and M. Strauss. (1991), A Dynamic Language Model for Speech Recognition. Human Language Technology Conference , Proceedings of the workshop on Speech and Natural Language table of contents. pp. 293 – 295.
[12] Jin R., Hauptmann A.G. and Zhai C.(2002) Title Language Model for Information Retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 42-48.
[13] D. Jurafsky, James H. Martin. (2000) “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition”. Pages 189-232.
[14] S.M. Katz. (1987) “Estimation of probabilities from sparse data for the language model component of a speech recognizer” , IEEE Trans. on Acoustics, Speech and. Signal Proc. ASSP 35(3), pp. 400-401.
[15] Kneser Reinhard, and Hermann Ney. (1995) Improved backing-off for m-gram language modeling. In Proceedings of ICASSP-95, vol. 1, pp. 181–184.
[16] P. Koehn, F.J. Och, and D. Marcu (2003). Statistical phrase based translation. In Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL). pp. 127-133.
[17] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. (2007) Moses: Open source toolkit for statistical machine translation. In ACL. Pages 177-180.
[18] C. Manning and Hinrich Schutze,(1999) Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, May 1999.
[19] H. Q. Nguyen, Pascal NOCERA, Eric CASTELLI, TRINH Van Loan., "A novel approach in continuous speech recognition for Vietnamese, an isolating tonal language," in Proc. Interspeech'08, Brisbane, Australia, 2008, pp. 1149-1152.
[20] Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. (2002). "BLEU: a method for automatic evaluation of machine translation" in ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311-318.
[21] Petrov, Slav, Aria Haghighi, and Dan Klein. (2008) Coarse-to-fine syntactic machine translation using language projections. In Proceedings of ACL-08. pp.108–116.
[22] Suryaprakash Kompalli, Srirangaraj Setlur, Venu Govindaraj. Devanagari (2009) OCR using a recognition driven segmentation framework and stochastic language models. International Journal on Document Analysis and Recognition. Volume 12 (2) Pages: 123-138
[23] A. Stolcke. (2002) SRILM – an extensible language modeling toolkit . In Proceedings of ICSLP, Vol. 2, pp. 901-904.
[24] O. Tran, A.C. Le, Thuy Ha, 2008. Improving Vietnamese Word Segmentation by Using Multiple Knowledge Resources. Workshop on Emirical Methods for Asian Languages Processing (EMALP), PRICAI.

[25] Zhang Jun-lin , Sun Le , Qu Wei-min , Sun Yu-fang, (2004) A trigger language model-based IR system, Proceedings of the 20th international conference on Computational Linguistics. pp. 680-686.
[26] D. Vergyri, A. Stolcke, and G. Tur, (2009) "Exploiting user feedback for language model adaptation in meeting recognition," in Proc. IEEE ICASSP, (Taipei), pp. 4737-4740.
[27] Witten Ian H., and Timothy C. Bell. (1991) The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37: 1085-1094.