Sonal Khosla, Haridasa Acharya


This paper is a survey of the existing methods of building a parallel Corpus. The paper starts with a short introduction to a parallel corpus followed and the applications of a parallel corpus. Parallel corpus built in different language pairs and the method adopted is discussed and presented. The paper covers some of the methodologies of the major parallel corpus built. The survey report is restricted to corpus built aligned at sentence and document level. 


Sentence Alignment; Web Mining; Parallel Corpus; Manual; Corpus

Full Text:



Ali, A., Siddiq, S., & Malik, M. K. (2010). Development of parallel corpus and English to Urdu statistical machine translation. Int. J. of Engineering & Technology IJET-IJENS, 10, 31-33.

Avramidis, E., Ruiz Costa-Jussà, M., Federmann, C., Melero, M., Pecina, P., & Van Genabith, J. (2012). A Richly annotated, multilingual parallel corpus for hybrid machine translation. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12) (pp. 2189-2193). European Language Resources Association (ELRA).

Aziz, W. F., Pardo, T. A., & Paraboni, I. (2008, October). Building a Spanish-Portuguese parallel corpus for statistical machine translation. In Companion Proceedings of the XIV Brazilian Symposium on Multimedia and the Web (pp. 369-371). ACM.

Botley, S., McEnery, T., & Wilson, A. (Eds.). (2000). Multilingual corpora in teaching and research (No. 22). Rodopi.

Bharadwaj, R. G., & Varma, V. (2011, March). Language independent identification of parallel sentences using wikipedia. In Proceedings of the 20th international conference companion on World wide web (pp. 11-12). ACM.

Bin, L. U., Jiang, T., Chow, K., & BENJAMIN K, T. (2010). Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora (pp. 42-49).

Brown, P. F., Lai, J. C., Mercer, R. L., 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 169–176.

Chang, B. (2004). Chinese-English parallel corpus construction and its application. In Proceedings of The 18th Pacific Asia Conference on Language, Information and Computation (pp. 283-290).

Chinnakotla, M. K., Ranadive, S., Damani, O. P., & Bhattacharyya, P. (2007, September). Hindi to English and Marathi to English cross language information retrieval evaluation. In Workshop of the Cross-Language Evaluation Forum for European Languages (pp. 111-118). Springer, Berlin, Heidelberg.

Choudhary, N., & Jha, G. N. (2011, November). Creating multilingual parallel corpora in indian languages. In Language and Technology Conference (pp. 527-537). Springer, Cham.

Cuřín, J., Čmejrek, M., Havelka, J., & Kuboň, V. (2004, March). Building a parallel bilingual syntactically annotated corpus. In International Conference on Natural Language Processing (pp. 168-176). Springer, Berlin, Heidelberg.

Dash, N. S., & Chaudhuri, B. B. (2001, November). Why do we need to develop corpora in Indian languages? In the International Working Conference on Sharing Capability in Localization and Human Language Technologies SCALLA-2001. Bangalore.

Eberle, K., Geiß, J., Ginestí-Rosell, M., Babych, B., Hartley, A., Rapp, R., Sharoff, S. & Thomas, M. (2012, April). Design of a hybrid high quality machine translation system. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra) (pp. 101-112). Association for Computational Linguistics.

Frankenberg-Garcia, A. (2009). Compiling and using a parallel corpus for research in translation. Babel: international journal of translation, 21(1), 57-71.

Garje, G. V., & Kharate, G. K. (2013). Survey of machine translation systems in India. International Journal on Natural Language Computing (IJNLC), 2(4), 47-67.

Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational linguistics, 19(1), 75-102.

Jagarlamudi, J., & Kumaran, A. (2007, September). Cross-Lingual Information Retrieval System for Indian Languages. In CLEF (pp. 80-87).

Jayaram, B. D., & Rajyashree, K. S. (2005). Corpora in Indian languages. Problems of Quantitative Linguistics, 323-329.

Liu, Z. (2013). Automated Building of Sentence-Level Parallel Corpus and Chinese-Hungarian Dictionary (Doctoral dissertation, WORCESTER POLYTECHNIC INSTITUTE).

Liu, W., Chang, Z., Teahan, W., 2014. Experiments with compression-based methods for English-Chinese sentence alignment. In Proceedings of Second International Conference on Statistical Language and Speech Processing (SLSP), Springer International Publishing, pp. 14–16.

Ma, X. (2006, May). Champollion: A robust parallel text sentence aligner. In LREC 2006: Fifth International Conference on Language Resources and Evaluation (pp. 489-492).

Martin, J., Johnson, H., Farley, B., & Maclachlan, A. (2003, May). Aligning and using an English-Inuktitut parallel corpus. In Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts: data driven machine translation and beyond-Volume 3 (pp. 115-118). Association for Computational Linguistics.

McEnery, T., & Xiao, R. (2011). What corpora can offer in language teaching and learning? Handbook of research in second language teaching and learning, 2, 364-380.

Megyesi, B. B., Hein, A. S., & Johanson, E. C. (2006). Building a swedish-turkish parallel corpus. LREC, Genoa, Italy.

Nair, L. R., & David Peter, S. (2012). Machine translation systems for Indian languages. International Journal of Computer Applications (0975–8887), 39(1).

Nazar, R. (2011). Parallel corpus alignment at the document, sentence and vocabulary levels. Procesamiento del lenguaje natural, (47).

Pilevar, M. T., Faili, H., & Pilevar, A. H. (2011, February). Tep: Tehran english-persian parallel corpus. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 68-79). Springer Berlin Heidelberg.

Post, M., Callison-Burch, C., & Osborne, M. (2012, June). Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation (pp. 401-409). Association for Computational Linguistics.

Rosen, A., & Vavrín, M. (2012). Building a multilingual parallel corpus for human users. In LREC (pp. 2447-2452).

Samy, D., Sandoval, A. M., Guirao, J. M., & Alfonseca, E. (2006). Building a Parallel Multilingual Corpus (Arabic-Spanish-English). In Proceedings of the 5th Intl. Conf. on Language Resources and Evaluations, LREC.

Shen, G. R. (2011). Corpus-based Approaches to Translation Studies. Cross-Cultural Communication, 6(4), 181-187.

Singh, A. K., & Surana, H. (2007a, June). Can corpus based measures be used for comparative study of languages? In Proceedings of ninth meeting of the ACL special interest group in computational morphology and phonology (pp. 40-47). Association for Computational Linguistics.

Singh, T. D. (2012). Building Parallel Corpora for SMT System: A Case Study of English-Manipuri. International Journal of Computer Applications, 52(14).

Sinha, R. M. K. (2009, August). Automated mining of names using parallel Hindi-English corpus. In Proceedings of the 7th Workshop on Asian Language Resources (pp. 48-54). Association for Computational Linguistics.

Sreelekha, S., Bhattacharyya, P., & Malathi, D. (2014). Lexical resources for Hindi-Marathi MT. In: The WILDRE2 2nd Workshop on Indian Language Data: Resources and evaluation.

Sridhar, V. K. R., Barbosa, L., & Bangalore, S. (2011). A Scalable Approach to Building a Parallel Corpus from the Web. In INTERSPEECH (pp. 2113-2116).

Srivastava, R., & Bhat, R. A. (2013). Transliteration Systems across Indian Languages Using Parallel Corpora. In PACLIC.

Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014). UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation. In LREC (pp. 1837-1842).

Tiedemann, J. (2007). Building a multilingual parallel subtitle corpus. Proc. CLIN, 14.

Yeka, J. R., Kolachina, P., & Sharma, D. M. (2014, May). Benchmarking of English-Hindi parallel corpora. In LREC (pp. 1812-1818).

Zhang, Y., Uchimoto, K., Ma, Q., & Isahara, H. (2005). Building an annotated Japanese-Chinese parallel corpus–a part of NICT multilingual corpora. In Second International Joint Conference on Natural Language Processing (pp. 85-90).

Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlu¨ter, P. (2013). DGT-TM: A freely available translation memory in 22 languages. arXiv preprint arXiv:13095226.

DOI: https://doi.org/10.26483/ijarcs.v9i4.6171


  • There are currently no refbacks.

Copyright (c) 2018 International Journal of Advanced Research in Computer Science