A RULE-BASED STEMMER FOR PUNJABI ADJECTIVES

This research work is concerned with the development of a rule-based stemmer for stemming of adjectives in the Punjabi language. Stemming is a method of deriving the root word from the inflected word. The proposed Punjabi Adjective Stemmer (PAS) uses a rule-based approach for converting the inflected Punjabi adjectives to their root forms. A database containing valid root adjectives occurring in the Punjabi language has been created. This database stores 1,762 Punjabi root adjectives. When an adjective word is fed to PAS as an input, first it compares the input word with the root database to determine whether the input adjective is a root adjective or an inflected one. If the input adjective is a root adjective, then no stemming is required and the input adjective is returned as the output. Otherwise, the inflected input adjective is sent to the suffix-stripping algorithm to get the corresponding root adjective. The suffix-stripping algorithm uses a set of predefined rules. India is a linguistically rich country with 22 languages recognized officially. But the computational resources developed for these languages are very scarce. Most of the stemmers developed for Punjabi language so far concentrated on nouns and proper names. PAS is the only stemmer developed so far for specifically addressing the problem of stemming of Punjabi adjectives. PAS has an overall accuracy of 88.76%.


INTRODUCTION
Stemming is the technique of removing affix of the input words (e.g., admitted, admitting) to their base form (e.g., admit). The "root" during this case might not be a true root word, however simply a canonical sort of the first word. Stemming uses a heuristic approach that cuts the end of a word to properly reforming words into its root form. Therefore, the words "admitted", "admitting" would possibly be converted to "admit" instead of "admitting" as a result of the ends was simply shredded off. There are different algorithms that can be used in the stemming process like Porter stemming algorithm, dictionarybased algorithm, rule-based algorithm, hybrid algorithm, stochastic algorithm, corpus-based algorithm, n-gram algorithm, matching algorithm etc.These all stemming techniques differ on the basis of their performance, accuracy and how to handle all the stemming interruptions [1].

II. LITERATURE REVIEW
A. Sharma et al. [2]proposed a stemming algorithm for Hindi information retrieval that uses a hybrid approach (a combination of brute force approach, suffix-stripping approach, and suffix substitution). This stemming algorithm was implemented for Hindi noun wordsand gives an accuracy of 92.2%. D. Kumar et al. [3] suggested stemming of the Punjabi words using brute force and suffix-stripping techniques to get maximum result. They got 81.27% accuracy with the database of 52000 words. N. Saharia et al. [4] adopted the suffix removal approach alongside a rule that developed all the possible suffixes. They got 82% accuracy by using the suffix removal approach with 20,000 root-word list. D. Bijal et al. [5]proposed an outline of stemming algorithms for Indian and non-Indian languages and recommended that stemming will considerably increase the retrieval results for each rule-based and statistical approach.Jivani et al. [6] suggested different methods of stemming and their comparison on the basis of usage, advantages and disadvantages. In this paper, they categorized stemming in three groups: truncating, statistical and mixed. Pandey et al. [7] proposed a rule-based stemmer of Marathi WordNet with the help of stem exception dataset and named entities. In this paper, they used Marathi WordNet to reduce the issue of over-stemming and under-stemming. They developed a hybrid system using a rule-based and machine learning approach to make the system more perfect. I. Slawik et al. [8] proposed integration techniques for selective stemming in SMT systems for removal of adjectives. In their paper, they conferred a technique to attenuate the issues of data sparsity while translating morphologically rich languages into less inflected languages for stemming of some word types. V. Gupta [9] proposed an automatic stemming of words for Punjabi language with eighteen suffixes for Punjabi nouns and proper names and sort of completely different suffixes for the Punjabi verbs, adjectives and adverbs using totally different stemming rules. R. Puri et al. [10] developed a Punjabi WordNet database for the Punjabi stemmer. In this paper, suffix-stripping approach with set of suffix removal rules has been used for creating a Punjabi stemmer. The algorithmic program mentioned in this paper uses regular expressions for locating suffix matches. A. Ramanathan et al. [11] proposed a lightweight stemmer for Hindi language. Suffix removal technique has been used to stem nouns, Verbs and adjectives in Hindi language. The proposed system helps to overcome the problem of overstemming and under-stemming. H. Singh [12] proposed study on Punjabi stemmers that puts focus on brute force technique. The papers conclude that approaches used by different researchers are almost same there is no big difference and we need to make new path in this area. D. Kumar et al. [13] suggested a method for design and development of Punjabi stemmer. In this paper, they are using brute force and suffix stripping technique and got 80.73% accuracy. C. Dhawan et al. [14] are proposed Punjabi stemmer based on hybrid approach. They had used brute force, suffix substitution and suffix removal to overcome the problem of over-stemming and under-stemming. V. Gupta et al. [15] proposed stemmer for the Urdu language by using the rule based technique. Stemmer is useful for complex and morphological rich words and they got 86.5% accuracy. S. paul et al. [16] suggested a design of rule-based Hindi lemmatizer. In this research, their aim was optimize time and obtain accurate results. They proposed 89.08% accuracy. V. Gupta et al. [17] proposed survey of stemming techniques and existing stemmer of Indian language. Dasgupta et al. [18]proposed an unsupervised morphological analysis of Bengali language. The algorithm is used to segment words into stem, suffixes and prefixes with no prior knowledge of the morphological rules of a particular language..It calculated on a group of chunks 4-110 phonetic words of Bengali language, the algorithm has the F-score 83% degree, considerably superior to semantics, one of the most used unmonitored morphological analyst, with about 23%. Goldsmith [19]developed an approach of unsupervised learning for languages morphology dependent on the "Minimum Description Length (MDL)", that focus on presenting the foundations of words as compactly as possible. This study reports based on the analysis of MDL to simulate an uncontrolled study of the segmentation of European languages using enclosures ranging from 5,000 to 5,00,000 words. The result corresponds well to the analysis, which will be expressed by the human morphologist.
Singh et al. [20] suggested systematic reviews of text stemming techniques. In their paper, they described existing text stemming techniques by classifying them on the idea of some key parameters as entirely totally different assortment schemes, variant size, and nature of corpus, multiple tasks so on. They created public at the end of article that unsupervisedstemming offers future directions to the researchers to enhance the performance ofunsupervised corpus-based stemming ways.

A. Categorization of adjectives in Punjabi language
An adjective shows some of the features of a noun or pronoun. In the Punjabi language, an adjective usually comes before the nouns but follows the pronouns. Punjabi adjectives are categorized into two parts:  Inflected adjectives-These adjectives operate as nouns and they change their form for gender, number and case, e.g. ਸਣਾ sohna "handsome", ਕਾਰਾ kala "black".  Uninflected adjectives-This class of adjectives do not change form for gender, number, or case, e.g. ਮਭਸਨਤੀ mehnatī "hardworking", ਭਸ਼ਸੂ ਯ mashhūr "famous", etc. In this paper, the rule-based approach has been used for stemming of Punjabi adjectives. Moreover, this approach comes with a combination of brute force and suffix-stripping technique to make the system more efficient and perfect.

B. Approach used
In this paper, the rule-based approach has been used for stemming of Punjabi adjectives. We have created a database of 1,762 root adjective words. To collect database we have gone through Punjabi dictionaries and various other resources and try to cover commonly used words. We have referred many resources that are given below: http://dic.learnpunjabi.org/default.aspx http://punjabi.aglsoft.com/punjabi/learngrammar/?show=adjec tive https://www.enchantedlearning.com Punjabi Vocabulary of Common Terms by Punjabi University Swan pocket English to Punjabi dictionary English to Punjabi Shabdkosh A Punjabi adjective word is entered as input to PAS. Then, PAS searches for the matching word in the root database to determine whether the input word is inflected or not. If the input adjective word is located in the root database, then it means that it is not inflected. In this case, PAS will provide the same word as output.Otherwise, the input word will be passes to the suffix-stripping algorithm to stem the suffix from the end of the input word using suffix-stripping rules defined in the system. The figure below represents the schematic diagram of PAS.

C. List of Suffixes for Punjabi Adjectives
We have grouped up adjective suffixes into different lists on the basis of suffix length. These lists help to develop rules for PAS. After thorough analysis of the adjectives which exist in Punjabi language, we discovered that the suffixes for adjectives range from minimum 1 character to maximum 5 characters in length. Following are the 5 suffix lists according to the length of suffixes.

D. Proposed Rules for PAS
After generating the suffix lists, rules for stemming of adjectives based on the length of suffixes were developed. A total of 39 rules were created to remove the suffixes from input word.
Rule 1: If a Punjabi adjective word ends with "ੑ ਯਤੀ ", strip the suffix "ੑ ਯਤੀ " at the last of the input word. Rule 2: If a Punjabi adjective word ends with "ਸ਼ੀਰਤਾ", strip the suffix "ਸ਼ੀਰਤਾ" at the last of the input word. Rule 3: If a Punjabi adjective word ends with "ੂ ਯਵਕ", strip the suffix "ੂ ਯਵਕ" at the last of the input word. Rule 4: If a Punjabi adjective word ends with "ਾਤਯ", strip the suffix "ਾਤਯ" at the last of the input word. Rule 5: If a Punjabi adjective word ends with "ੂ ਯਣ", strip the suffix "ੂ ਯਣ" at the last of the input word. Rule 6: If a Punjabi adjective word ends with "ੁ ਣਾ", strip the suffix "ੁ ਣਾ" at the last of the input word. Rule 7: If a Punjabi adjective word ends with "ਸ਼ਕਤੀ", strip the suffix "ਸ਼ਕਤੀ" at the last of the input word. Rule 8: If a Punjabi adjective word ends with "ਦਾਇਕ", strip the suffix "ਦਾਇਕ" at the last of the input word. Rule 9: If a Punjabi adjective word ends with "ਫਾਜ਼ੀ", strip the suffix "ਫਾਜ਼ੀ" at the last of the input word. Rule 10: If a Punjabi adjective word ends with "ਭੰ ਦੀ", strip the suffix "ਭੰ ਦੀ" at the last of the input word. Rule 11: If a Punjabi adjective word ends with "ਫਾਜ਼", strip the suffix "ਫਾਜ਼" at the last of the input word. Rule 12: If a Punjabi adjective word ends with "ਘਾਤ", strip the suffix "ਘਾਤ" at the last of the input word. Rule 13: If a Punjabi adjective word ends with "ਮ ਗ", strip the suffix "ਮ ਗ" at the last of the input word. Rule 14: If a Punjabi adjective word ends with "ਸ਼ੀਰ", strip the suffix "ਸ਼ੀਰ" at the last of the input word. Rule 15: If a Punjabi adjective word ends with 'ਵਾਨ', strip the suffix "ਵਾਨ" at the last of the input word. Rule 16: If a Punjabi adjective word ends with "ਕਯਣ", strip the suffix "ਕਯਣ" at the last of the input word. Rule 17: If a Punjabi adjective word ends with "ਉਣਾ", strip the suffix "ਉਣਾ" at the last of the input word. Rule 18: If a Punjabi adjective word ends with "ਦਾਯ", strip the suffix "ਦਾਯ" at the last of the input word. Rule 19: If a Punjabi adjective word ends with "ਸੀਣ", strip the suffix 'ਸੀਣ' at the last of the input word.
Rule 20: If a Punjabi adjective word ends with " ਸ਼", strip the suffix " ਸ਼" at the last of the input word. Rule 21: If a Punjabi adjective word ends with "ਭੰ ਦ", strip the suffix "ਭੰ ਦ" at the last of the input word. Rule 22: If a Punjabi adjective word ends with "ਖ ਯ", strip the suffix "ਖ ਯ" at the last of the input word. Rule 23: If a Punjabi adjective word ends with "ਕਾਯ", strip the suffix "ਕਾਯ" at the last of the input word. Rule 24: If a Punjabi adjective word ends with "ਈਆਾਂ ", strip the suffix "ਈਆਾਂ " at the last of the input word. Rule 25: If a Punjabi adjective word ends with "ਮ ਆਾਂ ", strip the suffix "ਮ ਆਾਂ " at the last of the input word. Rule 26: If a Punjabi adjective word ends with " ੀਆਾਂ ", strip the suffix " ੀਆਾਂ " at the last of the input word. Rule 27: If a Punjabi adjective word ends with "ਫੁੱ ਧ", strip the suffix "ਫੁੱ ਧ" at the last of the input word. Rule 28: If a Punjabi adjective word ends with " ੀ", strip the suffix " ੀ" at the last of the input word. Rule 29: If a Punjabi adjective word ends with "ਮ ਓ", strip the suffix "ਮ ਓ" at the last of the input word. Rule 30: If a Punjabi adjective word ends with " ੀਓ", strip the suffix " ੀਓ" at the last of the input word. Rule 31: If a Punjabi adjective word ends with "ਣ", strip the suffix "ਣ" at the last of the input word. Rule 32: If a Punjabi adjective word ends with " ਾ ਾਂ ", strip the suffix " ਾ ਾਂ " at the last of the input word. Rule 33: If a Punjabi adjective word ends with "ਈ", strip the suffix "ਈ" at the last of the input word. Rule 34: If a Punjabi adjective word ends with "ਤਾ", strip the suffix "ਤਾ" at the last of the input word. Rule 35: If a Punjabi adjective word ends with " ੀ", strip the suffix " ੀ" at the last of the input word. Rule 36: If a Punjabi adjective word ends with " ਾ", strip the suffix " ਾ" at the last of the input word. Rule 37: If a Punjabi adjective word ends with " ", strip the suffix " " at the last of the input word. Rule 38: If a Punjabi adjective word ends with " ੂ ", strip the suffix " ੂ " at the last of the input word. Rule 39: If a Punjabi adjective word ends with " ", strip the suffix " " at the last of the input word.

E. Proposed Algorithm
An inflected Punjabi adjective word is entered as an input to PAS (step 1). First, PAS will compare the input word with the words in the root database (step 2). If a match is found, it means that the input word is itself a root word and there is no need of stemming it. So, the same input word is provided at the output. If the input word doesn"t find a match in the root database, then PAS has to go through the suffix stripping phase in order to stem the input word to its root (step 3 to step 7). Here, PAS will check all of its 5 suffix lists one by one starting with suffix list-1 which contains the longest suffixes (of length 5 characters), then suffix list-2 (of length 4 characters) and so on until it finds a matching suffix. If the ending of entered word is matched with any of suffixes from these suffix lists, then PAS removes the respective suffix from the end.
There are some special cases related to suffix list-3 (step 5), suffix list-4 (step 6), and suffix list-5 (step 7). For example, at step 5, if the input word is " ਸਮਣਆਾਂ ", its suffix will match with "ਮ ਆਾਂ ". After removing the suffix, the word we will get is " ਸਣ" which is not a valid adjective. Thus, we add " ਾ" to the end to make it a valid adjective " ਸਣਾ". Similarly, at step 6, if the input word is " ਸਣੀਓ", its suffix will match with " ੀਓ".
After removing the suffix, the word we will get is " ਸਣ" which is not a valid adjective. Thus, we add " ਾ" to the end to make it a valid adjective " ਸਣਾ". In a similar manner, at step 7, if the input word is " ਸਣ ", its suffix will match with " ". After removing the suffix, the word we will get is " ਸਣ" which is again not a valid adjective. Thus, we add " ਾ" to the end to make it a valid adjective " ਸਣਾ". At step 8, if the input word"s suffix doesn"t match with any of the suffixes in the rule base, then it means that the entered word is a root word. It is therefore added to the root database and retuned as output.
Step 1. The inflected Punjabi adjective word is given as input to PAS.
Step 2. If the entered input adjective word matches with a root word in the root database, then PAS returns the same input word as output and goes to step 9. Otherwise, go to step 3. Step 3. It will check suffix list-1 (length-5). If the ending of entered word is matched with any of the suffixes from the suffix list-1, then it removes the respective suffix from the end and searches the stemmed word in the root database. If the word is found then it is returned as output and PAS goes to step 9. Otherwise, go to step 4. Step 4. It will check suffix list-2 (length-4). If the ending of entered word is matched with any of the suffixes from the suffix list-2, then it removes the respective suffix and then searches the stemmed word in the root database. If the word is found then it is returned as outputand PAS goes to step 9. Otherwise, go to step 5. Step 5. It will check suffix list-3 (length-3). If the ending of entered word is matched with any of the suffixes from the suffix list-3, then it removes the respective suffix.
If the matched suffix of the entered word is "ਮ ਆਾਂ " then after removing the suffix, we add " ਾ" at the end and then search the stemmed word in the root database. If the word is found then it is returned as output and PAS goes to step 9. Otherwise, go to step 6. Step 6. It will check the suffix list-4 (length-2). If ending of entered word matches with any of the suffixes from the suffix list-4, then it removes the respective suffix.
If the matched suffix of entered word is " ੀਓ", "ਮ ਓ", " ੀ" then after removing the suffix, we add " ਾ" at the end and then search the stemmed word in the root database. If the word is found then it is returned as output and PAS goes to step 9. Otherwise, go to step 7.
Step 7. It will check the suffix list-5 (length-1). If the ending of entered word is matched with any of the suffixes from the suffix list-5, then it removes the respective suffix. If matched suffix of the entered word is " ", " ੂ " then after removing the suffix, we add " ਾ" at the end and then search the stemmed word in the root database. If the word is found then it is returned as output and PAS goes to step 9. Otherwise, go to step 8.
Step 8. If the input word doesn"t match with any of the words of the root database and also its suffix doesn"t match with any of the suffixes in the rule base, then it means the entered word is a root word. It is added to the root database and also retuned as the output and PAS goes to step 9.

RESULTS AND DISCUSSION
Accuracy of a stemmer is dependent on the size of the root database and the completeness of the rules created to remove suffixes. Bigger database results in higher accuracy [22]. We have entered 1,762 root words in our database. To analyze the performance of PAS, entered 1,762 words as input, out of which 1,564 words werecorrectly stemmed. Therefore, the accuracy of our stemmer is 88. 76%.

V. CONCLUSION
The stemming system generates linguistically normalized text, which improves the result of many information retrieval applications [3]. Our stemmer works for Punjabi adjective words and this is a simplified version of a stemmer. Our system uses a rule-based suffix-stripping algorithm to stem the Punjabi adjectives.This stemmer can work as a basic preprocessing tool for Punjabi language and will be helpful in various NLP applications and in text mining. VI.