PERFORMANCE OF FITNESS MEASURES TO RECOGNIZE TELUGU CHARACTERS

The research in character recognition is a good old application in the area of pattern recognition and has attracted many researchers during the last few decades. There are various applications of character recognition such as in banks, post offices, defence organizations, reading aid for the blind, library automation, language processing and multimedia design. The handwritten Telugu characters which are complex in nature to recognize are considered in the current work by extracting relevant features using genetic algorithm. The performance of the system is compared by testing the algorithm, with two different fitness measures.


I. INTRODUCTION
Handwritten character recognition involves three basic steps namely development of database, feature extraction and classification of characters. One of the major aspects in HCR is digitization of documents to develop the database. There are several devices available for digitization of documents such as scanners, cameras, mobile-cameras, etc. India is a diverse nation and is rich in literature. As of today there are 33 languages and 2000 dialects, of which 22 are recognized under the constitution. The most popular South Indian languages are Telugu, Tamil, Kannada, Malayalam, Tulu, etc., [1,2]. The alphabets of these languages have large number of basic and compound characters. The primary challenge of any character recognition system is to design a framework to handle the text layout in the image, character fonts, sizes and variability in imaging conditions with uneven lighting, reflection and shadowing. Even though all these problems can be taken care using sophisticated equipment, there are many more issues. The following are some of the issues in handwritten character recognition: i. Several distortions are introduced while scanning the documents [3].
ii. For old and historical (palm-leaf) documents, which are poor in quality, these distortions are inevitable. iii.
Recognition of handwritten characters is a tedious task as the style of writing varies from individual to individual. iv.
Variation in font style, size, orientation, alignment and complex background makes the character recognition phase a challenging task.
There are 18 vowels and 36 consonants in Telugu language. The 18 vowels and 36 consonants are shown in Fig. 1(a) and (b) respectively. A few similar groups of characters are shown in Fig. 1(c). This clearly indicates that recognizing such similar groups of characters is very difficult. Moreover there are no standard databases available for Indian languages and hence it becomes very difficult for the development of handwritten character database [4]. This paper describes the process of recognizing the basic characters written on paper documents. Section II describes existing methods for recognizing handwritten characters. The procedure used to recognize the handwritten characters is explained in section III. The experimental results are discussed in section IV. The paper is finally concluded in section V.  The static topologies like slice-based, hierarchical based, uniform and non-uniform strategies were discussed in [8]. Adaptive topologies discussed by them were discriminationbased, perception-oriented, template-based and Voronoibased. Combination of zoning topologies and parameterbased membership functions were used to recognize handwritten numerals from CEDAR database. In [9] the authors reported a recognition rate of 92% for CEDAR dataset, using neural network classifier and fuzzy zoning technique.
Radtke et al. [10] worked on NIST handwritten digit dataset (Latin numerals), consisting of 50,000 training and 10,000 testing patterns. Set of features extracted from these patterns were contour-based information, concavities and pixel distribution. The best recognition rate reported was 95% using nearest neighbor as a classifier.

III. METHODOLOGY
The handwritten characters written on A4 size paper documents are scanned at 300 dpi. In the current study the number of classes or characters for which the database developed is 50. Each character is written on a paper in a rectangular box in different sizes and styles by 85 individual writers. These scanned documents are preprocessed to extract the characters from the documents to conduct experiments. The document images are binarized first using Otsu's algorithm. The noise introduced due to scanning is removed using morphological tools such as dilation and erosion. The characters are then extracted by applying minimum boundary rectangle concept. Therefore a total number of 3,750 samples (50 characters x 75 samples) were developed for training whereas 500 samples (50 characters x 10 samples) were developed for testing purpose. The size of all the character images is then normalized to MxM (M=50 in the current work) without changing the aspect ratio of the character.

A. Feature Extraction
The motivation behind feature extraction is to estimate the attributes that are most appropriate to represent a character [11]. Its primary objective is to maximize recognition rate with minimum elements. The normalized raw character images are then used to extract the useful features. Due to the nature of penmanship (style of handwriting) with its high level of variability and imprecision, extracting such elements is a troublesome assignment [12]. In this work, the features are extracted from the character images by dividing them into smaller zones of size 10 × 10. Hence each character contains a total of 25 zones for a normalized image of size 50×50. From each zone the 4-and 8-directional pixel distributions are computed by superimposing 3 × 3 masks. The 4 and 8 directions considered are shown in Fig. 2 (a) and (b) respectively. The number of features extracted for each character image is 100 and 200 along 4-and 8-directions respectively. The most challenging task is to identify the relevant features that help to distinguish any pair of characters [13,14,15]. The relevant or optimum features are then found using one of the fast search methods namely genetic algorithm. The relevant features extracted from the search mechanism are used for classifying the handwritten Telugu characters using k-NN classifier.

B. Genetic algotirhm
Genetic algorithm (GA) is a search method to find useful solutions over generations. Initially a population H of size D (desired number of features) is generated randomly. Each solution is represented as a binary string. These random solutions are evaluated with the two proposed fitness measures. In 'g' generations, the entire population is evaluated. The best fit solutions are selected as parents for the next generation. These are used to reproduce the new population by performing crossover and mutation steps. The primary steps involved in GA to generate a new solution are described as follows: Selection: Based on the fitness measure computed the fit solutions survive in the selection step. The unfit solutions make space for the new solutions in the next iteration/generation. Crossover: The new solution is produced by crossing two fit solutions. The bit-wise logical AND is performed between the solutions to generate the new one.
Mutation: One of the bits in the binary string obtained from the crossover step is muted to produce a new offspring.
The algorithm is terminated if maximum number of generations is reached. In every generation the worst solution is discarded to provide room for the offspring produced in the next generation. The newly generated population is again evaluated using the fitness function. The best solution containing subset of features from this algorithm is used to classify the handwritten Telugu characters using k-NN classifier.
The steps involved in genetic algorithm are as follows: 1. Initialize the population with random solutions. 2. Evaluate the population using fitness function.
3. Terminate if maximum number of generations reached else go to step 4. 4. Reproduce new population by crossover and mutation.

C. Proposed fitness measures
The solutions in the population survive in the next generation based on the fitness computed. Generally genetic algorithm is suited for maximization problem however minimization problems can also be solved by performing suitable transformation. The control parameters of genetic algorithm set in the proposed work are tabulated in Table 1. The fitness measures presented in the current work to evaluate the population in every generation are Hamming distance and Cross correlation coefficient. These are described as follows: Hamming distance: The number of positions at which the bits in two binary strings differ is the hamming distance. This distance is computed by taking XOR of the two strings/solutions of genetic algorithm x i and y i and is depicted in Equation (1). (1) The minimum the hamming distance the better is the solution in the whole population. For maximization problem of genetic algorithm the function HD(x) is modified as (2) Cross correlation coefficient: The number of positions at which the bits in two strings are same is the degree of similarity between them. It is given by The cross correlation coefficient is given by (4) Where n x and n y are the number of cells occupied by the solutions x and y respectively. The larger the value in the entire population is preferred (maximization problem).

IV. RESULTS AND DISCUSSIONS
The Genetic algorithm is started with an initial population, whose members are uniformly distributed in the range [L b, U b ]. The lower bound (L b ) is set to 1 and the upper bound (U b ) is set to n f (total number of features). For the first generation, the features are selected randomly. The positions of the features selected are set to logic 1 and the features that are not selected are set to logic 0. From second generation onwards the solutions in the population are modified based on the steps involved in the technique as discussed in the previous section.
As the absolute subset size can't be anticipated, the desired number of features/dimension size (D) is allowed to vary in the simulations to identify the optimum subset size. The maximum subset size set as n f /2, (half the size of the original feature vector) in the simulations to achieve at least 50% optimization.
The k-NN classifier is trained only with the optimum features derived from the optimization technique to classify the handwritten Telugu characters. The optimization results obtained for 4-and 8-directional pixel distributed feature sets are shown in Figs.3 and 4, respectively.
The best results obtained with the two fitness measures are tabulated in Table 2. The optimum number of features needed to better recognize the handwritten Telugu characters are also tabulated.
The results obtained with the cross correlation fitness measure are better compared to hamming distance measure. Hence the memory needed to save the features is reduced with the search algorithm GA and proposed fitness measures to recognize the handwritten Telugu characters.

V. CONCLUSION
In this work, investigations were carried out to recognize handwritten Telugu characters by extracting 4-and 8directional features. With two fitness measures namely crosscorrelation and hamming distance were used to evaluate the population in the Genetic Algorithm. The optimum solution (subset of features) obtained from GA is used to recognize the handwritten Telugu characters. The best recognition rate obtained is 89.08% with an optimum subset of 85 features. In future the work can be extended by employing other feature extraction and search methods.