REMOVAL OF DUPLICATES IN DATABASE RELATIONS AND THE ASSOCIATED PROPAGATION MANAGEMENT

: Removing duplicate records in the relations of a database is an essential operation and it is a crucial and a critical step in the data integration. If the record duplication problem is unmanaged or miss managed it leads to poor quality, consistency, integrity of data. The present paper reviewed the problem contexts of data duplication and the techniques available for the management of the problem. This work also proposed some improved techniques to deal with data duplication problem. A set of data fusion techniques are proposed. A new way of data propagation is presented that should follow the fusion result to maintain data consistency.


INTRODUCTION
The occurrence of data duplication is common in the relations of a relational database. This data duplication if not managed earlier the problem becomes worse after the formation of many referenced relations.
The record duplicate data problem is also called entity resolution or record linkage problem. Within the same relation the same real world objects are represented by using multiple descriptions leading to the confusion in understanding the object properties. The problem of representing the same object with multiple descriptions occurs due to data missing, data modification, data deletion, typographical errors and not following standard rules and procedures in data manipulation operations. There does not exist any standard and more generalized framework for accurate and effective data manipulations of databases. Also there does not exist any standard and deterministic methodology for finding duplicate tuples in the same relation without using primary key concept for identifying duplicate data descriptions in the tuples. The duplication arises in two forms. Partial data duplication occurs when the data is duplicated for a subset of a tuple instance. This can be removed by simple database operations. When the data duplication is for the entire record this is called full duplication. Full or complete data duplication in two different tuples of the same relation is not allowed knowingly but it is allowed unknowingly. Whenever such unknown full data duplication occurs it must be identified and then removed with the best consolidated or fitted tuple. The aggregated tuple must match with all the duplicated tuples to the greater extent. For simplicity purpose duplicated tuple set is called bad tuple set. The number of bad tuples in the bad tuple set may be either two or more. The entire bad tuple set must be replaced with one better tuple that is more than 90% is similar to the all the tuples in the bad tuple set.

LITERATURE REVIEW
Data duplication generally occurs due to manual errors and misassumptions. To deal with the data duplication database related management tools are available such as the use of not null, null, on delete cascade, on update cascade and so on. Other intelligent management techniques were proposed by various authors in the literature of the data duplication management. In the database management literature many research people have identified that one solution for controlling referential integrity is by means of classical techniques already available in the database management system software using null, not null, default, on delete cascade, on update cascade, and restricting controls. These techniques control referential integrity but do not support semantically correctness of the relationships in the relation of the database after the completion of the data fusion operations. Second solution for better management of relationships semantically is to use generalized and semantic version of existing data referential integrity management techniques. M.A. Herna´ndez and S.J. Stolfo stated that, in the database community, the record linkage or record duplication problem is described as merge-purge [7]. Ahmed K. Elmagarmid et al. [1] said that duplicate record detection is the process of finding different or multiple records that refer to one unique real-world entity or object or record. Authors also said that for duplicate record detection they have implemented a variety of string similarity metrics, such as Jaro, edit distance, and q-gram distance. Ravi Kumar and Cohen [8] follow a similar approach and proposed a hierarchical, graphical model for learning matched record pairs. B. Zhao et al. [2] proposed a Bayesian approach to perform data fusion operation. It learns the quality of data sources and incorporates the learned knowledge in the data fusion operation. Because of its stateless nature the proposed approach is not up to the mark in the online setting. Web applications commonly require duplicate-free data and error-free representation of records. The goal of the former is achieved through Record Linkage (RL) technique while the latter is achieved through Data Fusion technique. The two techniques -Record Linkage and Data Fusion are the two well-studied problems [1] [5]. While signi ficant effort has been dedicated to solve the above problems but a very little work has been conducted to apply them at the query execution time. Hotham Altwaijry et al. [3] said that efficiency, scalability, performance and data quality are the main challenges of entity resolution and entity resolution can be computationally expensive. Hairong Dong and David Evans [4] defined data fusion as a formal framework that express the means and tools for the alliance of data originating from diverse sources. It aims at obtaining information of greater quality; the exact definition of 'greater quality' will depend upon the application. Kamakshi Lakshminarayan [6] explored explores the use of machine-learning based options for data imputation, in dealing with missing data. The authors proposed two wellknown machine learning techniques. The first one is data clustering which is an unsupervised learning strategy that make use of a Bayesian approach to cluster the data into classes. The resultant groups of clustering were used to predict multiple choices for the attribute of interest. The second one is a supervised learning technique that models the missing variables by a supervised induction of a decision tree-based classifier. This model predicts the most likely value for the attribute of interest. Empirical tests have been performed in order to compare the two proposed techniques. These tests showed that both approaches are useful and have limitations too. Verykios et al. [10] proposed a set of techniques for reducing the complexity of record comparison. Sarawagi and Bhamidipaty [9] designed an efficient code called ALIAS, a learning-based duplicate detection system that uses the idea of a "reject region".
From the literature it is evident that there exist different types of data fusion functions such as 1. First order fusion function 2. Second order fusion function 3. Join fusion function 4. Set oriented fusion function 5. f-optimal fusion function 6. f-value functions 7. Random fusion functions 8. Maximal coherent fusion functions

PROBLEM CONTEXT
Incomplete data are everywhere in data sources and as a result, available data are inefficient and often biased. Sometimes database modifications result record duplication in the relations. Data duplication is always a challenging situation. Identification, removal and replacement of the undesired tuples from the relations of a database are called data fusion operation. For effective implementation of data fusion operation the present set of available techniques are not complete. There is a need for new techniques in a high level semantic manner. Assume that there exists a parent relation and one or more referenced relations. Also assume that there exist duplicate tuples in the parent relation. Generally data fusion is performed as a first step and data propagation is performed as a second step. In the first step duplicate tuples are identified and replaced with correct tuples and in the second step modified details from the parent relations. Fusion functions are used in the data fusion step. The data propagation may be either backward or forward. On delete cascade is on solution to maintain referential integrity in database operations but this approach introduces randomness in the process of selecting and deleting and which tuples must keep them as it is. As a result of this these is no guarantee of maintaining data quality and semantically correctness of relationships after successful completion of data fusion operation.

Objectives
For effective database management duplicate tuples must be identified using a more generalized framework and then removed two or more incorrect tuples with one or more correct tuples. Database must always satisfy data consistency property before and after database modifications. Database must always satisfy data integrity constraints in particular referential integrity constraints before and after data modifications. There exist many techniques such as set null, set not null, on delete cascade, on update cascade, restrict and so on for controlling and smooth management of referential integrity constraints. All these techniques are specialized techniques only but not generalized techniques to propagate database updating and deletions in a high level semantic procedure way. Existing methods do not provide quality relationship management in a semantic way. In modern very large database management systems there is a need to apply and use optimized quality of data relationships among the relations of a database. Present study proposed a new semantic based framework for efficient, accurate, effective and optimal quality of relationships management in the database operations. This new data fusion technique is independent of another record duplication finding methods so that the new technique can be applied for very large and different varieties of data fusion operations as an independent, semantic, generalized and scalable approach in the domain of SQL data management. In the present paper a running example is taken for better understanding of data fusion operation on linked relations with the intention of preserving referential integrity as well as semantically correctness of the relationship in the relations of a specific database. The proposed algorithm for controlling data fusion operations is well defined, designed and proposed a wellinvestigated data propagation algorithm which can manage, control, and coordinate the net impact of a fusion operation on joined relations with respect to both data preservation of referential integrity and the semantic correctness of the linked relationship after successful completion of the data fusion operation. The algorithm takes care of consistency of referential integrity after the data fusion of duplicate tuples in the main parent relationship of the selected database and the algorithm uses a standard framework of fusion functions that operate on multi-valued data.

The approach
The proposed methodology attempts to fuse the set of duplicate records into one record for maintaining database consistency against modifications of the database. Different strategies are developed for this. In the first strategy the fusion function makes use of the attribute union. Union is applied either attribute value by attribute value basis or record by record basis whichever is convenient or possible. In the second strategy the attribute mean is used to update the missing value of the attribute in order to fuse the records. This is called mean imputation in machine learning terminology. In the third strategy majority value of the attribute is used to update the missing value of the attribute. Updating the missing value with majority value is particularly useful when values of the attribute are categorical or discrete only. The first strategy is explained with the following work out. A database consisting of three relations are considered for explaining the fusion operation in the relations. The three relations are Establishment, Entrance, and Entrance details which are respectively are shown in TABLE-1, TABLE-2,  and TABLE-3. For simplicity and easy understanding purpose only a limited set of missing values are taken and then fusion operation involving union is applied.      To remove duplicate tuples from the parent relation a fusion function is used. The first order fusion function takes a set of duplicate tuples and then maps them to one correct tuple. If the first order fusion function is true only for subset of attributes then it is called partial preservative and if it is true for all the attributes then it is called full preservation or simply preservation fusion function. Second order fusion function operates on multi-valued data. Second order fusion function takes multiple sets of duplicate tuples from the parent relation and then replaces them with one particular correct set of tuples. In the literature second order fusion function are called multi-valued fusion function replaces sets of input duplicate tuples with one particular and correct input set of tuples.

Example database one
The second strategy fuse the duplicates by filling the missed value with attribute mean and it is called the mean imputation technique. When the attribute values are of categorical or nominal the mode of the attribute is considered to fill the missing value and this is called the mode imputation. This strategy is explained with the following example with the help of tables  x x y - In the above table if the first three records are assumed as duplicates the mean imputation strategy is applicable to attribute "att2" and the mode imputation strategy is applicable to the attribute "Att3".similarly the fusion of the last two records can be made. The resulted fusion is shown in table 8 and Table 9. The third strategy fuses the duplicates by filling the missed value with majority attribute value and it is called the majority imputation technique. This is almost similar to the mode imputation strategy.

ALGORITHM
The algorithm used for data fusion and associated propagation is presented here.

Proposed Algorithm
Algorithm Data-Fusion-Propagation

Algorithm description
Steps 1 and 2 for each tuple t in duplicate set, D, a set of linked tuples in R* are constructed and stored in the set S t . Steps 3, 4, 5 and 6 tuples in S t are assigned to * after replacing the foreign key with primary key.
Step 8 projected set of tuples with respect to primary key are stored in S-projected.
Step 9 tuples resulted after applying fusion function to S-Projected is stored in B Step 10 for all tuples in B non key attributes are filtered Steps 11 to 14 resolve any conflicts and store the final result in R*

CONCLUSION
In this paper a new framework is identified for removing duplicates from relations of data through data fusion. Data propagation is followed for making the data consistent. This framework is semantically correct and it is more generalized version of traditional methods such as null, not null, on delete cascade, on update cascade, and restrict and so on. DBMS must control referential integrity constraints wherever tuples are deleted from the parent table. New framework intelligently manages not only referential integrity problems but also semantically related details with modified data. Data fusion process uses data fusion functions. This work proposed three strategies for data fusion operation. All the strategies are explained with numerical examples. In the future there is a scope to find and use new fusion function. The main disadvantage of the forward data propagation is that the linked datasets of tuples cannot be fused directly by a multi valued fusion function. Data fusion and data propagation operations must be considered separately and hence the memory required must be independent of the number of data propagations. Different multi valued fusion functions will give different accuracy results. In general, the accuracy depends on the size of the set of duplicate tuples. Many techniques will give better results than on delete cascade operation particularly when the duplicate dataset is very large. Data propagation technique improves the data quality.