High Order Conditional Random Field Based Part of Speech Taggar and Ambiguity Resolver for Malayalam -a Highly Agglutinative Language
Main Article Content
Abstract
Parts of speech tagging also called grammatical tagging assign lexical class markers to each and every word in a document. It is an essential and important preprocessing step in many NLP systems. Tagged corpora play a significant role in Machine Translation, Information Retrieval, and Data Mining. POS tagging in Malayalam is a difficult task as it is an agglutinative language and 80-85% of words in Malayalam text documents are compound words. Decomposition of these words into its constituents is extremely necessary for finalizing the POS tag of these words. Sometimes more than one morphological analysis and hence more than one POS may occur for a single word. A correct resolution of this kind of ambiguity for each occurrence of the word is crucial in many NLP applications. Currently available tag sets in other languages are only giving importance to the morphological and syntactical properties of the language while the tag set designed by us considers the semantic features of the language. For testing this system, documents from well known Malayalam news papers and magazines are selected. Up to 2352 sentences are tested which includes simple, complex and compound type sentences. Word level tagging accuracy of 95% and sentence level accuracy of 91% are obtained.
Â
Â
Keywords: POS Tag set, finite state transducer, compound word splitter, Extended CRF, Malayalam compound word
Downloads
Article Details
COPYRIGHT
Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
- The journal allows the author(s) to retain publishing rights without restrictions.
- The journal allows the author(s) to hold the copyright without restrictions.