REVIEW ON PERFORMANCE IMPROVEMENT OF HETEROGENEOUS HADOOP CLUSTER USING RANKING ALGORITHM

Enhancing technologies increases the use and growth of information technology. As we can see the growth of data which is increasing rapidly in every single minute. This exponential growth in data is one of the reason for the generation of rapid data. Stored data is processed to extract worth from inaccurate data to form a way for the parallel and distributed processing for Hadoop. All the nodes in Hadoop are assumed to be in homogeneous nature but it is not same as it looks like, in cloud different configuration systems are used which represents it logically. So data placement policy is used to distribute data on the basis of power of node. Dynamic block placement strategy is used in Hadoop, this strategy work as the distributing input data blocks to the nodes on the basis of its computing capacity. The proposed approach balance and reorganize the input data dynamically in accordance with each node capability in an heterogeneous nature. Data transfer time is reduced in the proposed approach with the improvement in performance. Block placement strategy, page ranking algorithm and sampling algorithm strategies are used in the proposed approach. The data placement strategy used works as decreasing the execution time and improving the performance of the clusters which are of heterogeneous nature. Big data are handled using Hadoop. Small files are handled using applications on Hadoop so that the issue of performance can be reduced on the Hadoop platform. Better performance improvement is shown in the proposed work.


I. INTRODUCTION
In the year 2006 [1], Doug Cutting and Mike Cafarella, developed Hadoop as a framework which is an open source computing and processing of large datasets in distributed environment. System failure and data loss can be reduced in the proposed work. Failure of any node does not matters because of the availability of thousands of interconnected node. Thousands of data and its frequent transfer is tackled among the nodes. Big data is handled by the Hadoop [2]. Hadoop is widely known because of its increase in popularity and handling of big data. To achieve better performance several techniques are used.

B. Big Data :
The collection of massive amount of data is called Big Data. This data can be in any form either structured or unstructured, relational or non-relational. Unstructured data is the audio, video, text, image or any different pattern of data. In recent years, Big data has become very popular in several different fields. This is a big opportunity in business field. Large transmission and communication of data generates large data from various sources. The need of data mining algorithm is required to process big data. Earlier production of large data is responsible due to corporate world but in recent years users have become responsible for its data.

C. Dynamic Block Placement Strategy :
Dynamic Block Placement Strategy works on two basis Homogeneous cluster and Heterogeneous cluster.

1.Homogeneous Cluster :
Depending upon the availability of space in a cluster data is distributed among the nodes in homogeneous cluster. Hadoop has a feature of balancing the data, this functionality of balancing is called Balancer, which balance the data before running the applications. Whenever conditions occur like on any node large data is accumulated then in this case balancer is an important functionality. Replication is an important function, balancer is responsible for it to care for replications. It is the key feature in data movement [5].

Heterogeneous Cluster :
Data is transferred from one node to another node ideally in an heterogeneous environment. The faster node faces overheads while processing and data transfer. This results in exploring of the issues comes at the time of data transfer. Data placement policy explores the arising consequences. And this all occurs in the heterogeneous environment. Implementation of data placement policy provides the details of better goals [5].

II. RELATED WORK
Many of the author stated and researched about the Big data, deal with them and also checks there functionality to work on homogeneous and heterogeneous cluster of data. Jeffrey Dean et al. In [1] described about Hadoop, Hadoop process terabytes of data which is of large amount. The Apache group written Hadoop using java technology. It works as parallel processing to process large clusters. Hadoop is attractive and open source framework. It process the data and replicates it in reliable manner. It is designed in a manner to run commodity cluster. Low cost low performance working in parallel is preferred by commodity computing. HDFS is an Apache project for Hadoop, it is distributed, low-cost and have high fault-tolerance file system. It is convenient for large data and provides with high throughput. Its deployment is not costly. Single name node in each cluster maintains file system of meta data and application data is stored by multiple nodes. MapReduce analyzes large data and advantageous for various organization. Machine learning, indexing, searching, mining are some MapReduce applications. Traditional SQL are used to implement these applications. Also helps in data transformation, parallelization, netwrok communication and handling fault tolerance. Andrew Wang et al. In [2] explained about HDFS which is a distributed file system. It stores files across cluster node redundantly for the purpose of security. HDFS divides files into blocks and replicates them depending on the factor of replication. The block placement policy is the default in HDFS, it works as distributing blocks across cluster node. Many conditions like unnecessary load on cluster can be possible at any time which results in reducing the overall performance of cluster. Konstantin Shvachko et al. In [3] proposed about block placement which plays a vital role in performance and data reliability terms. Reliability, availability and network utilization are improved by this data block placement strategy. At the time of creation of new block, the first replicated block is assigned in first location of the block. And the other replicates will be assigned randomly on different nodes keeping in mind that only two replicates should be placed in a single rack. Namenode provides datanode for HDFS, which helps in reducing network traffic and improves performance.
Fang Zhou et al. In [4] describes that application master generates inputsplits in a Hadoop MapReduce. One inputsplit is generated for a small file. One map container can use only one inputsplit, which explains that number of inputsplits and number of map containers are equal, which is the issue because it creates many map container for a small file. If many containers are created then it require many processes, resulting in many overheads. Similar overheads are generated for reduce container

III. PROBLEM DOMAIN
Hadoop handles the Big data, which is now become a great deal to manage because of generation of data in every single minute. Big data is also the opportunity for business. But here we are looking forward on the issue of homogeneous and heterogeneous cluster. Homogeneous cluster where clusters nodes are of similar form means they are homogeneous but load alloted on every cluster are not similar. This overloading of any cluster decreases the performance of cluster node and also reduces overall performance of outcome. Similar to heterogeneous cluster [6], where data nodes are of different sizes and Hadoop works as distributing equal amount of load on the nodes. In this case if one node of larger size is experiencing low load then in that case that particular node completes it work firstly but it waits for the other node to complete there work because all these nodes after processing computes on the result, so for it, it waits for other nodes which results in reducing performance and increasing computation time.
Problem in it comes as, there are number of cluster nodes of different sizes like 2GB, 4GB, 6GB and 8GB. For example these four nodes are taken and the master node here works as dividing all the nodes with the number of nodes. Means the master node will divide those nodes with four.
In this mitigation approach, we are working on the issue of performance and computation time [8] and will be achieved in our solution domain.

IV. SOLUTION DOMAIN
The solution provided to the above problem will be described as using page ranking algorithm and sampling algorithm. Where page ranking algorithm works on the basis of frequency. The one who is having better frequency will run first. Page ranking is used for ranking purpose and on the basis of frequency, ranking is alloted. Weight and frequency is calculated using page ranking algorithm and we used it in work work.
Here, heterogeneous cluster [7] is used and its performance is increased using ranking algorithm which depends on the frequency of occurrence. Whose frequency is maximum will be run first.ssss And sampling algorithm works as randomly selecting the nodes instead of mentioning all the possible samples. Probability of selecting is the sum which is equal to the sample size of data n. So, will result in increase in performance [8] with reducing computation time and distributing the overall load in equal to all the datanodes.
. V. CONCLUSION The mitigation approach concluded that the data nodes can be of any size or of same size should not be overloaded and each node should assign the equal amount of load depending on there size which will result in increasing performance. This paper attempts to improve performance of heterogeneous cluster in Hadoop using Ranking algorithm which works on the basis of frequency, the one who is having maximum frequency will be executed and run first. It reduces the computation time and improving performance.