SEARCH FOR IN-SILICO APPLICATIONS IN DRUG DISCOVERY AND APPLICATIONS OF DIFFERENT DISCIPLINES IN IT: A SURVEY

The present paper surveys on different areas in designing of a new drug that can be performed by means of in-silico methods. Diverse applications of different subjects like Biology, Chemistry, Mathematics, Statistics, Physics etc. in different stages of human drug designing process have also been reviewed in this paper from computational points of view.


I. INTRODUCTION
Designing drugs for a specific disease is a process whose outcome is a new suitable drug for that disease. The overall drug designing process is time consuming, requires huge spaces to store biological data and very costly. It takes several years, approximately ten to fifteen years, for a drug to be available in the market. Modern drug development strategies try to minimize this time and also try to make the process space and cost effective by applying computational techniques with traditional methods. Application of computer science in drug development processes helps in addressing these issues.
The notion of drug development relates to the field of biology and chemistry, in general. But, the research areas in drug design can not only be limited to these two fields. It can be broadened to other fields like mathematics, statistics, physics and moreover computer science. When information system is combined with these fields, they are collectively known as bioinformatics, cheminformatics, pharmacology, etc. Computer aided drug development process assembles many scientists from different subject areas to work collaboratively.
The overall drug development process is composed of several stages. Modern drug development process starts with identification of drug targets followed by validation of these targets, discovery of lead drugs and optimization of lead drugs. After this, optimized lead drugs go for preclinical and clinical testing. Finally, the new drugs are brought to the market.
Aforesaid fields of study can be applied to each of these stages of drug design to gear up the process. The present paper discusses about the contribution of these fields in each and every stage towards design of a new drug only from computational end and also highlights the areas of human drug designing process where in-silico methods can be applied in order to turn the whole development process time and cost effective.
The rest of the paper is arranged as follows: Section II defines the basic biological terminologies. Section III presents detailed discussion of the computer-aided drug design (CADD) process. The computational tasks those are to be performed by CADD process are given in Section IV. Contributions of different disciplines in CADD process are discussed in Section V. Conclusion is drawn on Section VI. Next the references are given.

II. BASIC BIOLOGICAL TERMINOLOGIES
Definitions and descriptions of some of the terms related to the context of this survey paper are given in this section.
Chromosome: The genetic information is packed into a thread-like structure in nucleus of each cell of any living organism. This structure is known as chromosome. The major components of each chromosome are nucleic acids and proteins.
Nucleic Acid: Transfer of genetic information from one generation to the next generation is carried out through nucleic acids. There exist two types of nucleic acids, namely, deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). Nucleic acids are composed of nucleotides whose constituents are four types of nitrogen bases, a phosphate group and a 5-carbon sugar. Among the four types of nitrogen bases in DNA, adenine (A) and guanine (G) are purines with two fused rings; cytosine (C) and thymine (T) are pyrimidines with a single ring. The nitrogen bases of RNA are same as that of DNA except for thymine which is uracil (U) in this case.
There are two strands of DNA, namely, 5'-3' and 3'-5' which run in opposite direction and hold together by base pairing of the hydrogen bonds between purines and pyrimidines. According to the base pairing rule, an A and a G in one strand are always paired with a T and a C on the other strand of the DNA respectively. Base pairing between a DNA strand and a RNA strand is performed in the same way as that of base pairing between two strands of DNA; only difference is that an A in DNA strand is paired with a U in RNA strand. That means, each strand of DNA form a character string consisting of A, T, G and C; ordering of these bases in the string is known as DNA sequence. RNA sequence is series of A, U, G and C. These two sequences are responsible for transformation of genetic information into proteins in chromosomes. Fig. 1 shows parts of two strands of DNA sequence and one strand of RNA sequence.
Protein: Protein is a sequence of twenty types of amino acids. The transformation of information from DNA to protein is a two step process, namely, transcription and translation. In the first step, the information stored in DNA transferred to messenger RNA followed by translation of this information into proteins in the second step. Complementary base pairing between bases of DNA and transcribed RNA are done in transcription. Conversion of genetic information into amino acids is done with the help of code table known as codon. The procedure for synthesis, modification and regulation of proteins is known as protein expression.
Gene: In each chromosome, the genetic information lies in the form of genes which are made up of DNA. The overall procedure of transcription and translation of genes into proteins is known as gene expression. Not all parts of a gene are translated to proteins. It has two parts fragmented throughout the sequence as protein coding and non-coding parts. Numbers of bases in a gene sequence vary in sizes from some hundreds to some millions. Number of genes in humans is also huge.
Gene Mapping: Gene maps are used to get information about how genes are arranged on a chromosome. The specific location of genes on a chromosome is called its locus. It is used to find distances between genes on a chromosome. Gene mapping is a technique which is used to find locus of a gene. It is also helpful for prediction of inheritance types of distinguishing features of a living organism. This helps in understanding of disease related characteristics.

Gene Expression Profile:
A test is done to identify all genes in a cell that are taking part into generation of messenger RNA as it is responsible for translating genetic information into proteins. This test is known as gene expression profile. Analysis of this profile has many therapeutic uses. It is used for diagnosis of a disease and also helps in checking the response of body towards a treatment.
Protein Structure: There are four levels in protein structure which are primary, secondary, tertiary and quaternary.
Protein sequences are available with a chain like structure (polypeptide chain) in which lots of amino acids (20 types) are joined together. A water molecule is lost due to the joining of amino acids. That means, protein sequence is actually formed with amino acid residues. In a protein sequence there are more than 50 amino acid residues. Two ends of polypeptide chain are known as Nterminus or amino terminus (on the left end) and Cterminus or carboxyl terminus (on the right end). This structure of protein is considered to be the primary structure. In this structure amino acids are joined together by means of peptide bonds. Fig. 2 shows the fraction of primary protein structure where three letter abbreviation for amino acid residues like Cys for Cysteine, Ala for Alanine, Glu for Glutamate etc. are used.
Secondary protein structures are of two types -αhelix and β-sheet. Hydrogen bonds between the hydrogen atom of N-terminus and oxygen atom of C-terminus on main peptide chain are used to define the secondary structure.
Three-dimensional structures of protein are its tertiary and quaternary structures. To carry out biological function, proteins are needed to be in their threedimensional shape. Protein folding is a method which brings proteins into their final three-dimensional functional shape. All the levels of protein structure are essential for protein folding.
The methods X-ray crystallography, Nuclear Magnetic Resonance (NMR), etc. are used to determine structure of proteins.
Knowledge of protein structure is required for development of drug compound.
Sequence Alignment: It is a technique to find similar sequences that helps in identifying sequences with similar functionality or identical structures. In this technique, sequences are compared base by base in case of nucleic acids and amino acid residues are compared in case of proteins. Matching between any two sequences may be partial (local alignment) or they may be fully matched (global alignment). Sequence alignment is used to recover structural similarity by finding similar sequences as that of the sequence in which part of the sequence is missing. This is beneficial for treatment of diseases.
Sequence Analysis: It is a process through which it is possible to understand features, structural information, functionality and evolutionary characteristics of nucleic acid and protein sequences. The task of sequence analysis is to know the order of nucleotides or amino acid residues in a DNA, RNA or protein sequence, searching of biological databases and sequence alignment. It is also helpful for creation of a DNA sequence by aligning and joining number of DNA pieces. Sequence analysis has many applications. One of them is in treatment of genetic diseases.

III. COMPUTER AIDED DRUG DESIGN (CADD)
Drug design is an innovative process that finds new medicines or drugs for diseases. This process uses knowledge of biological target in designing new drugs. Extraction of drugs from plants is a traditional way. Modern drug discovery methods consider drug as a chemical or biological substance that has medicinal uses.
The traditional drug design techniques are based on the study of molecular biology [1], system biology [2] [3] and cell biology [4] and all of these techniques are full of time consuming, expensive physical experiments. Modern drug discovery methods have replaced these physical experiments with computational search and as a result the cost and time of the overall process are also become lower.
Computerization of drug designing process has many contributions in modern drug designing field [5] [6]. While developing new drugs, the development process needs to handle lots of raw biological data. The volume of these data is huge. Some computational approaches towards biological data or computerization of these data also help to store and maintain them in manageable form.

A. Biological Target (Drug Target)
Biological target, also referred to as drug targets, are those small molecules in host organisms in which pathogens continue to live. Any disturbance in the functioning of these molecules will destroy the survival environment of the pathogens. These target molecules may be receptor, enzyme, protein and nucleic acid and are responsible for progression of disease with some desirable therapeutic functions or unwanted harmful functions. Drugs are designed in an aim to change the behavior of the target molecules by binding the drug to the target and as a result the pathogens will be no more.

B. Different Types of CADD Process
CADD process comes with two forms. These are structure based or direct drug design and ligand based or indirect drug design [7]. When the three dimensional structure of drug targets are available, structure based designing process is used, otherwise, ligand based designing process is used.
1. Structure Based Drug Design Process: This process is decomposed into number of stages [8]. Two major stages prior to any pre-clinical and clinical tests are target selection and lead compound selection (potential drug). Different stages of structure based drug design process are shown in Fig. 3.
Target Selection: This stage is used to select probable drug target for a specific disease. This stage is basically consisting of two steps -target identification and target validation. Numbers of in-silico methods are available to perform these two steps successfully [9].
Target Identification: In this step, molecular targets those are causes for progress of disease are identified. This step performs lots of computations on biological sequences like query processing, sequence alignment and sequence analysis. This step also performs gene selection related to disease, analyzes the genes those are related to drug action, screens poisonous side effect of genes, does functional prediction, gene and protein annotation and prioritization, collects structural information or data about gene and protein expression, compares two or more sequences, analyzes gene expression profiles, maps information and differentiates between healthy and diseased cells. X-ray crystallography, nuclear magnetic resonance (NMR), homology modeling and protein folding methods are used in target identification to determine three dimensional structures of proteins as well as their binding or active sites. Homology Modeling is a technique that generates a model for three-dimensional structure of target protein sequences based on structural similarity with known protein sequences. It is an iterative process.
Target Validation: In the second step, identified targets are verified for their therapeutic advantages for patients and are selected as targets for drug i.e. it is tested to see whether the identified targets are capable of producing desired clinical results or not. It is an improvement or a reduction step. All of the identified targets are not selected as drug target. Some percentages of them are selected based on their priority. Some of the computation tasks to be performed in this step are mapping of genetic network, protein-protein interactions, predictions of sub-cellular localization etc.
Lead Compound Selection: The selection process of lead compound is again a two steps process namely, lead identification and lead optimization.
Lead Identification: In this step, a chemical compound is identified that shows biological or pharmacological behaviour towards a drug target and that compound is medicinally beneficial. Computer-aided techniques like protein crystallography, nuclear magnetic resonance (NMR), de novo design and computerized searches of structural databases to study the existing drug's pharmacophore, known as virtual screening, helps in identifying suitable lead compounds. Virtual screening is used in the scoring, prioritizing and filtering of a numbers of structures that use computer programmes. The task of de novo design is to design new molecules based on threedimensional structure of a target.
Lead Optimization: In this step, identified lead compounds are tested for their effectiveness, toxicity and absorption power towards a disease and corrective steps are taken accordingly followed by which potential drug is selected. Molecular docking, a computer algorithm, is used to determine how a lead compound will bind to the active site of a target protein [7]. It is used to test how two molecular structures, one for lead compound and the other for target protein, fit together i.e. protein-drug interaction.
2. Ligand Based Drug Design Process: In this process, the knowledge of molecular structure of small molecules those are responsible for biological or pharmacological functioning of the molecules (known as pharmacophore of the molecules) when tied up with the drug targets is considered. The small molecule is termed as ligand. This process is useful for deriving a model (pharmacophore model) for drug target (when structural properties of target are missing) based on the structural information of the molecule that binds to the target. Ligand based drug design process starts with the identification of pharmacophore of a ligand after selection of drug target. Next, based on this pharmacophore, the molecular structure of the ligand is modified iteratively so that it is best fitted for the biological target and treated as potential drug. After which pre-clinical and clinical tests are performed on this potential drug compound. Fig. 4 shows the steps of ligand based drug design process in-between drug target selection and preclinical test.
3D-QSAR (three-dimensional quantitative structure-activity relationships) is a computational method that is used in the process of ligand based drug design. The quantitative relationship between the favourable or unfavourable effects of a group of compounds and their three-dimensional features are studied by using the 3D-QSAR technique that facilitate in understanding of the new chemical compound to be treated as drug.
Another method that is used in ligand based drug design process is based on the structural and physical similarities between ligand and known drugs in an expectation to have similar binding properties of the ligand as that of known drugs.
After successful selection of potential drug, it undergoes for pre-clinical (animal) and clinical (human) testing followed by prediction of drug-drug-interaction. The results of interaction between two or more drugs are required when they are applied together. By their combined application one drug may influence the activities of another drug to a great extent. After having positive results out of these steps, the final product of drug is obtained and it is marketized.

C. ADMET Properties of Drugs
Prediction of absorption, distribution, metabolism, excretion and toxicity (ADMET) properties of a drug is very essential in drug development process. Determination of optimal ADMET properties in preclinical test of potential drugs allows concentrating on limited number of them and assures their success in clinical test.
Absorption: Due to absorption, it is possible to know how a drug dissolves in blood after entering into the body. Distribution: Distribution makes it possible to determine the movement of drug from organ to organ through the blood.
Metabolism: Due to metabolism, the chemical structure of a drug is altered inside the body.
Excretion: Excretion relates to the removal of drug from the body.
Toxicity: Toxicity means poisonous effects of a drug.
These are required to determine the proper dose and timing of a drug. Different in-silico tools for ADMET prediction have also been available [10] [11].

D. Drug Repurposing
The inputs of existing drugs are also taken into consideration while designing new drugs. Among the existing drugs, some drugs may have more side effects than others. So, they may not be used in treatment of their intended diseases but may be safe and suitable for treatment of new diseases. Drug repurposing is a technique in which knowledge of existing drugs is studied thoroughly in designing drugs for new diseases. This technique searches for those drugs among the existing ones that can be reused with slight alterations in their structures, doses and timings.

E. Biomarker
Biomarker (Biological Marker) is a feature of a biological molecule that can be present in blood or in other body fluids or in tissues and shows the normal or diseased condition of the body. This molecule can be genetic or biochemical characteristic or any other substance that helps in identification of a disease. Biomarker is used to observe the response of a body towards a certain treatment or in evaluation of normal or pathogenic processes. It is used in all of the above mentioned phases of CADD process. It has many clinical uses.

A. Database Management in CADD
Computerization of biological data results in creation of various biological and chemical databases. Computer aided drug design process also needs to handle biomedical data and drug data [12]. Biomedical data are received from different pharmaceutical companies, hospitals, nursing homes and clinical laboratories in large volume and with higher dimensionality. These data may also be received from any public network. Drug data may include sequence data, gene expression data, protein-drug or protein-protein or drug-drug interaction data and data of some other types like patients record either in the form of electronic data or in the form of report. That is big data analytics are associated with these data for their systematic management [12].
Numbers of software, tools and databases are available to facilitate the drug development process. Database preparation is a fundamental and an important task in all the stages of drug design process. Bioinformatics and cheminformatics have made it possible to create databases for storing structural information and for various biological sequences of different organisms as well as for biomedical or drug data. The overall process needs to handle different types of biological and chemical databases for genomes, proteins, amino acids or nucleic acid, different types of databases for storing annotation, sequence, structural and functional information or some other types of information. Different information obtained from pre-clinical and clinical studies of potential drugs are also made available in later time by creation of respective databases. Some of these databases are GenBank, EMBL, GEO (Gene Expression Omnibus), etc. All these databases can be retrieved from the server of National Centre for Biotechnology Information (NCBI) [13]. One such web server is developed by Ying Liu, et al. that performs biological sequence alignment [14]. List of some public domain databases in medicinal field can be found in [15]. Drug design process starts after identification of a disease by searching disease databases. KEGG [16] and MalaCards [17] are two databases for storing information related to human diseases. Bioinformatics and cheminformatics tools are available to create different medicinal databases. SWEETLEAD is an cheminformatics database [18] whereas ChEMBL is an bioinformatics database [19] to be used for drug designing purpose. Some online databases like BindingDB and ChEMBL [20] are also available for the same purpose. Creations of more advanced databases and web servers using bioinformatics or cheminformatics tools have become an important research area in the field of drug discovery.

B. Use of Database Searching Tools in CADD
With the development of databases, there is also need for generation of database searching tools. BLAST is one such tool [21]. BLAST finds local alignment between sequences. There are different types of BLAST tools like Nucleotide BLAST, Protein BLAST etc. Some of the research works regarding construction of robust searching tools for different biological databases can be found in [22] - [25].

C. Big Data Analytics in CADD
Big data problem associated with these databases has also been able to draw the attention of the researchers and remedy of this problem is again use of bioinformatics and cheminformatics algorithms. Some of the existing works relating to solution to this problem are given in [26] - [29]; these works are using machine learning approach, artificial intelligence, pattern matching and the software Hadoop to solve the big data problem. Big data also helps in selecting drug target [30] and in virtual screening [31]. A method for identification of drug target path on biomedical data has been given in [32].
The computerization of drug design process is not only restricted to storing and maintenance of biological data. It is supposed to do any kind of task that needs computation. It is applicable in diverse functioning of different phases of drug design process as stated in subsection B. It is also applicable in prediction of ADMET properties of drug, identification of biomarkers and drug repurposing. Fig. 5 depicts different in-silico tasks of CADD process.

V. CONTRIBUTION OF DIFFERENT DISCIPLINES IN CADD
Each and every step of the drug development process opens a new door to the research zone in computer aided drug design. The subject areas covered by this process mainly include bioinformatics [33], cheminformatics [34], pharmacokinetics [35] and pharmacodynamics [36].
Bioinformatics is a field of study that analyzes biological data using mathematics, statistics and computer science [37] whereas cheminformatics is the field of study that uses computer and information systems to solve chemical problems [38]. Bioinformatics concentrates on collection, storage, inspection and controlling of biological data whereas cheminformatics does the same for chemical data. Bioinformatics is used to select drug target and helps in the screening process of the candidate drug; not only that, it also helps in determining side effects of a drug and in predicting drug resistance [33]. Structural bioinformatics, a branch of bioinformatics, also have a large contribution in drug discovery. It becomes helpful in analysis and prediction of three dimensional structures of proteins or nucleic acids. The research work of D. K. Brown and O. T. Bishop discusses the role of structural bioinformatics in drug discovery that uses computational SNP analysis [39]. The use of cheminformatics tools to select lead compounds has been discussed in [34].
Both of bioinformatics and cheminformatics need to do the tasks of pattern recognition and data mining for clinical data throughout the entire process of CADD. Different machine learning approaches are used to identify drug target [40]. Performances of machine learning techniques applied in solving protein folding problem are measured in [41]. Algorithms for ligand based virtual screening using machine learning approaches are discussed in [42]. Clustering techniques are also applied on chemical data for the purpose of analyzing these data [43]. Graph theoretic approaches are also applicable in the process of drug design. These techniques have been applied in ligand based drug design [44] and structural analysis of protein active sites [45].
Sometimes bioinformatics techniques are integrated with cheminformatics techniques, known as biochemoinformatics, to design more robust drug [46].
Computational biology has many applications in drug discovery. Bioinformatics tools have made it easy to use the features of computational biology and this has been discussed in [47]. Not only computational biology, other subjects like biophysics, sociology, biotechnology have many applications in computational drug design which will be discussed later in this section.
Pharmacokinetics is the study of predicting ADME power of a drug over time i.e. reaction of body on a drug. It helps in deciding the proper dose and safety of a drug [48]. On the other hand, pharmacodynamics is the study of a drug effects on its targets i.e. to the body which depends on dose and time of the drug. Pharmacology is the study of drugs that combines the areas of pharmacokinetics and pharmacodynamics. Pharmacokinetics study is generally used at the end of drug development process. Recently, prediction of ADME properties is performed at the very beginning of the drug development process in order to eliminate molecules with low ADME properties. Pharmacological studies are performed in target and ligand screening, drug repourposing and clinical testing of a drug. Some of the computer based methods regarding these uses of pharmacology can be found in [49] - [52].
Omics studies such as genomics, proteomics, transcriptomics and metabolomics play significant roles in the above mentioned stages of drug development as well as in the stages of pre-clinical and clinical testing [53]. It is also used in drug repurposing and in identification of biomarker. Each of these fields is an individual research area. The demand for computational omics studies, either structure based or function based, in drug designing is increasing rapidly. Several bioinformatics and cheminformatics tools are available to assist in different functioning of computational omics studies. Some of the research works in this area have been shown in [54] - [65]. The method given in [54] helps in prediction of drug-protein interaction and the methods presented in [55] use data mining techniques in analyzing protein-protein interaction data collected from biological studies. In [56], computational methods are used in structural genomics and in [57], the methods of computational proteomics or lipidomics to be used for drug design are reviewed. Computational methods for predicting protein structure to be applicable in drug design process are discussed in [58]. Different aspects of proteomics in CADD process have been discussed in [59] [60]. The work as given in the paper [61] shows the use of transcriptomics in identification of a lead compound. Application of metabolomics in drug designing process has been shown in [62] [63]. Application of machine-learning in metabolomics and identifying drug-drug interaction has been shown in [64]. Omics data mining has also an application in drug repurposing [65].
Another field of study, known as pharmacogenomics, has also become effective in selection of optimal drug, its dose, its treatment process and its side effects [66]. It is the combined study of pharmacology and genomics i.e. it can be said that it relates to the part that a genome plays in response to a drug. The task of computational pharmacogenomics has been outlined in a chapter of the book [67].
Biophysics also has a great contribution in developing a new drug. Biophysical technologies like Xray crystallography, nuclear magnetic resonance spectroscopy, surface plasmon resonance spectroscopy etc. are considered to be key components of drug discovery [68]. Biophysics is also involves in automated drug design by predicting automated structure and annotation of proteins [69]. Numbers of research works have already been done in this area. Some of them are given in [70] - [72] which are capable of predicting protein-protein interaction and drug-target interaction.
Uses of sociological studies have several advantages in drug development [73]. The biotechnological and genetic engineering methods also play important role in pharmaceutical industry which is responsible for final marketization of newly discovered drugs along with other different kind of tasks. The research works as described in [74] and [75] show these features. Prediction of protein structure and protein sub cellular localization can be performed very well using bioinformatics tools and these are also major tasks in biotechnology. That means, bioinformatics tools are capable of performing biotechnological functions and functions of pharmaceutical science in turn.
So, from the above discussions it is clear that the process of drug discovery does not constitute a single discipline rather it is combination of numbers of disciplines like biology and biophysics -used to identify biological targets; pharmacology and chemistry -used for prioritization or validation of drug targets and for selecting lead compounds; all of chemistry, pharmacology and toxicology -used for pre-clinical testing; pharmaceutical science -used to produce final medicinal product of drug; medicine -used for clinical testing of a drug. These disciplines are assisted by other disciplines that perform computational tasks like mathematics, statistics, computer science and information system. Fig. 6 shows this fact.

VI. CONCLUSION
While keeping in mind the contributions of bioinformatics, cheminformatics, pharmacology, biophysics and sociology in drug development process, other fields like biotechnology, genetic engineering, medicinal industry, and pharmaceutical industry have also adapted these fields for discovery of new drugs. That is, none of the subject can solely claim the ownership of discovery of new drugs. It is the product of combined effect of all of these subjects and this makes this field a challenging research area.