xmlBLASTparser V1.1 — A PHP BASED NCBI BLAST XML OUTPUT PARSER

: xmlBLASTparser is a lightweight PHP library for parsing an XML formatted output of NCBI BLAST sequence alignment and rendering into attractive web page. The biological database accession numbers present in each sequence alignment hit have properly hyperlinked to the original source. Moreover, hit ids in the description summary is anchor hyperlinked to the corresponding sequence alignment section. The xmlBLASTparser library can be easily embedded or integrated in a web page at server-side through standalone NCBI BLAST software or RESTful web service of NCBI BLAST. The output of xmlBLASTparser has the same flavour of the online NCBI BLAST. xmlBLASTparser is freely available under terms of GNU General Public License version 3 (GPLv3), at https://github.com/AshokHub/xmlBLASTparser.


I. INTRODUCTION
Sequence alignment is one of the well-known and most widely used method in bioinformatics/molecular biology for a broad range of applications such as sequence analysis, phylogenetic analysis, homology modeling, structural motif prediction, domain prediction, molecular fingerprint prediction, pattern matching, function prediction, genome fragment assembly, SNP analysis, biodata integration, etc. BLAST is a popular local sequence alignment tool originally developed by Steve Altschul and his team members at the National Institutes of Health (NIH) during 1990 [1][2][3]. Due to the high demand of BLAST tool, NCBI has released various BLAST programs based on the demand of life scientists . The different types of  the BLAST program include BLASTN, BLASTP, BLASTX,  TBLASTN, TBLASTX, IgBLAST, SmartBLAST, BLAT,  MOLE-BLAST, WU BLAST, PSI-BLAST, PHI-BLAST,  MegaBLAST, DELTA-BLAST, RPS BLAST, AB BLAST, CaBLAST, Parcel BLAST, BLASTZ, VecScreen, CDART, CD-search, GEO, Primer-BLAST, etc. [4]. Moreover, NCBI has extended their service through several modes such as Web BLAST, Stand-alone command line BLAST, WWW BLAST, Cloud BLAST, BLAST URL API, Remote BLAST+, and C++ BLAST API [5].
Through NCBI BLAST tool, we can able to perform three types of nucleotide or protein sequence comparisons: (i) pairwise sequence comparison, (ii) query sequence against the set of sequences (local database), and (iii) query sequence against the large set of sequences from the external biological databases. The commonly used database for sequence comparisons are NR, ENV NR, NCBI GenBank/RefSeq, DDBJ, RCSB PDB, SwissProt/UniProt, EBI EMBL, PIR, PRF, PAT, EST, and DBSTS. The NCBI BLAST delivers output of sequence alignment result in various types of file format which include ASN.1 (Text), ASN.1 (Binary), Hit Table ( [4,5]. xmlBLASTparser is a small PHP program used to parse the output of the NCBI BLAST sequence alignment result and generate a rich webpage with well formatted alignment.

II. METHODS
The output files of NCBI BLAST sequence alignment result are programming language specific and can be easily parsed for various sequence analysis. Most of the server-side applications have own user interactive graphical wrapper to execute and retrieve output of the sequence alignment result. BLASTphp is a simple PHP library used to wrap or embed NCBI BLAST tool in a web page and retrieve the output in various file formats [6]. In general, the output of sequence alignment can be categorized into four sections: (i) header section -consist of details of BLAST program, query definition, database, and program parameters; (ii) descriptive summary -list of matching hits, subject definition, bit score, and E-value; (iii) sequence alignment -sequence length, bit score, E-value, identities, positives, and gaps; and (iv) footer section -number of sequences searched, length of database, and algorithm scores. The programming language specific BLAST outputs and accessing methods are given in the Table I bellow. There are other output file formats such as SAM, ASN.1 (Binary), and HTML which cannot be parsed using programming languages, instead those files can be read using a suitable viewer. HTML file formatted output is a standard type to view through a web browser. It is similar to the Text file format except the clickable hyperlinks. Text file formatted output can be simply viewed through any ASCII/Unicode code supported text editor.

III. RESULTS AND DISCUSSION
XML (eXtensible Markup Language) is a software or hardware independent and customizable (except HTML tags) markup language designed for storing and retrieving data. In XML, tags were arranged in hierarchical order similar to HTML, where XML tag names act as variables and content between the tags are values [7]. The XML file formatted output of NCBI BLAST sequence alignment result consists of a major section known as <Iteration> which contains a brief description of the matching sequences, HSP score parameters, and the sequence alignment of each hit [8] (Figure 1). The <Hit_id> and <Hit_def> tags were used to annotate database accession number with a hyperlink to the original source through sequence identifiers (gi, gb, pdb, etc.) using regular expression. Similarly, the tags <Hsp> and <Hit_def> tags were used to generate the descriptive summary of sequence alignments and the hits ids were annotated with anchor hyperlinks to the corresponding sequence alignment section using regular expression (Figure 2).

A. Utility
A standardized XML file formatted output obtained from the sequence alignment result of NCBI BLAST is used as the input for parsing through xmlBLASTparser. The different types of methods adopted to retrieve the XML file are given bellow: • Online tool -It is a simple method to download an XML file from the online NCBI BLAST tool after performing the sequence alignment. Alternatively, we can also obtain the XML file through command line execution of Remote BLAST or Local BLAST using the standalone NCBI BLAST+ tool. The PHP script to read the XML file is $xml = simplexml_load_file("output.xml"); • RESTful service -It is a widely used method by software developers to obtain an XML file from online at back end through any server-side programming languages. There are many O|B|F bioprogramming modules such as BioPerl [6], BioRuby [7], BioJava [8], BioPython [9], and BioConductor [10] were used for similar functionality. BLASTphp [11]

IV. CONCLUSION
xmlBLASTparser is a simple PHP script which consumes very less bandwidth and resource on the web server. It can be easily integrated with any NCBI BLAST applications and sequence alignment information can be parsed from the XML file formatted output. The current version of xmlBLASTparser generates tabular formatted rich web content with annotations. Through combining BLASTphp and xmlBLASTparser library into a PHP web form can able to build a sequence alignment tool analogue to Web NCBI BLAST. The sequence alignment generated by xmlBLASTparser is well formatted and identical to the NCBI BLAST sequence alignment result. The current version xmlBLASTparser v1.1 provides a brief summary of matching hits with detailed alignment scores. xmlBLASTparser is still under development, as we are currently focused on generating CDS region prediction, and graphical descriptive summary of sequence alignment from XML output using jQuery and CSS in addition to the xmlBLASTparser PHP library.