Using Pig on Hadoop for Data Analysis in Bioinformatics

Main Article Content

Smita Saxena

Abstract

Data storage, processing and analysis is a major component of bioinformatics. Apache Hadoop provides a distributed computing framework for processing large voluminous data. Pig is an Apache open source project that works on the Hadoop platform and let the programmer write the queries or scripts in its procedural dataflow language known as Pig Latin rather than writing core MapReduce programs in Java directly. Pig provides a lot of statements similar to SQL clauses and some other advanced features. The Pig platform compiles the statements and scripts and generates the equivalent map and reduce tasks and sends to Hadoop for execution. It helps to process the biological data available in large sizes, which can be analyzed in an effective way within a small time. Also in contrast to traditional (R)DBMS, unconstrained data may be used and the database schema also need not to be constrained or consistent or pre-defined.

 

Keywords: Bioinformatics, Hadoop, Pig, Pig Latin.

Downloads

Download data is not yet available.

Article Details

Section
Articles