ANDROID MALWARE DETECTION USING HAML

Abstrac: The malware is a very common term in today’s scenario. It is very harmful for our device. It is continuously gaining the rise in its quantity. It is proving to be a challenging task to detect the malware because whenever we come to evade a technique for its detection, the attackers also evade the new technique to overcome with our detection technique. Presently we have two techniques for the analysis of an application to be a malware or a goodware. these are : static analysis and dynamic analysis Mostly anti-virus software uses signature-based detection technique but it is inefficient in the today’s scenario because of the rapid increase in the number and variants of malware. The signature is a unique identifier for a binary file, which is created by analyzing the binary file using static analysis methods. The dynamic analysis uses the actions and behavior during runtime to find out the type of executable (either malware or benign). Both methods have their own benefits as well as drawbacks. This paper proposes a new technique which uses HAML(Hybrid Analysis with Machine Learning).Hybrid analysis is the combined form of static and dynamic analysis to analyses the executable file Machine Learning is used to classify an unknown executable file. In this method, known type of malware and the benign programs are used as training data. By analysis of the binary code and dynamic behavior, the feature vector is selected. The proposed method utilizes the benefits of both static and dynamic analysis thus the efficiency, and the classification result is improved. Our experimental results show an accuracy of 95.87% using static, 97.17% using dynamic and 98.72% using the embedded method. As Compare to the standalone dynamic and static methods, our HAML method gives the more accurate results and is proved to be more efficient.


INTRODUCTION
The Internet is becoming an important part of people's everyday life as the online payments, and online banking is being popular nowadays. The users of Internet face security threats by malicious software. These malicious softwares are known as Malware which is a program that is specially developed to harm the user's device or user's data in a manner such as stealing the private data etc. without giving any notification to the user. Depending on the behavior and the way they infect, malwares are classified as spy-ware, worms, root-kits, viruses, Trojan Horses, etc. Thousands of new malwares are being developed every day, and the existing malwares are also modifying in their structure,so it becomes very difficult to detect. Due to the vast amount of new samples emerging every day, security specialists and antivirus vendors depend on automated malware analysis tools and methods in order to distinguish malicious from benign code [1]. Mostly antivirus products uses signature-based malware classification method [1,2,3]. In this method, malware programs are determined by making comparison of the unknown programs with the known malware programs present in the database. The signature is a antique label provided to a binary file. It can be also called as unique identification. The signature may be created using static, dynamic or hybrid methods and stored in signature databases. Because new malwares are being created each day, the signaturebased detection approach requires frequent updates of the virus signature database which is the main disadvantage of the method. Static analysis, extracts the features from the binary code of programs and use them to create models.
These models are then used to classify the program as a malware or a legal software. The static analysis fails at different code obfuscation techniques [4] used by the virus coders and also at polymorphic and metamorphic malwares [5]. But there are advantages to static analysis that the binary code contains very useful information about the malicious behavior of a program in the form opcode sequence and functions and its parameters. On the other hand code obfuscation techniques and polymorphic malwares fails at dynamic analysis [6] because it analyses the runtime behavior of a program by monitoring the program while in execution. The main advantage is that it analyses the runtime behavior of a program which is hard to obfuscate [7,8]. But there are some limitations to dynamic analysis. Each of the malware samples must be executed within a secure environment for a specific time for monitoring the behavior. The monitoring process is time-consuming, and it must ensure that the execution malware cannot infect the platform [9]. The secure environment is quite different from a real runtime environment, and the malware may behave in different in the two environments, causing an inexact behavior log of the malware [3]. In addition, some actions of malware are activated or triggered under some certain conditions (system date and time or some particular input by the user) may not be detected by the secure virtual environment [2]. But dynamic analysis is a necessary complement to static approach as it is very much preventive against code obfuscations. Both static and dynamic methods have their own advantages and disadvantages. So a combined method that utilizes both static and dynamic features will be promising in the malware classification. The proposed method uses both static and dynamic features of malwares and by using machine learning techniques, provides an efficient automated classification of malwares.

Embedded static and dynamic method
Mostly the works in malware classification uses either the static analysis or the dynamic analysis methods. But, our proposed method combines the positive aspects of both the methods. We taken the static features from the binary code. Then collected the malware executables from the VirusShare [10] community website. And collected the Printable strings information (PSI) from the binary, which is used as a static feature. The tool cuckoo[11] sandbox is used for performing Dynamic analysis. Dynamic analysis is mainly focused on sequences of the system call. By combining the features extracted from the binary code and the behavior of the file in execution might be adequate for a better classification result. The proposed method uses machine learning for the automated classification and detection.

Architecture of the proposed method
The architecture of the proposed method is shown in Figure 2. The static and dynamic analysis is performed on the dataset containing both malicious and benign files. Static analysis is done by extracting the PSI features, and dynamic analysis is done by extracting API call sequence. The method is explained in the following sections.

Static analysis and Static features
Feature extraction process is the major part of any classification task. The static features are extracted from the malware binary files and given as input to various classification algorithms. In this work printable string information (PSI) which is extracted from the binary files is used as the static feature. Printable strings are the un-encoded strings present in the binary executable file. Many literatures show that PSI is one of the best features that can be extracted from binary executable [2,12]. Code obfuscation techniques may insert many unwanted PSI to the binary files. So not all the PSI extracted from the binary files are significant and used in the classification. The extracted PSIs are processed so that the output contains strings that are meaningful in the classification. The PSI extracted are sorted according to the frequency of occurrence within a file and PSIs with a frequency below a particular threshold are eliminated. A global list of PSI called feature list is created which contains all strings that are selected from each of the executable files in the dataset both malware and benign. An entry in the feature list is a feature. Each of the malware and benign files is compared with the list and then represented by a binary vector denoting the strings which the malware sample contains or not, recorded as a true/false binary value.
Algorithm 1 shows the process of static feature vector creation. The following example clarifies the static feature extraction and feature selection process. Consider three files corresponding to three binary files after extraction and processing: The frequency file is created from these files which will look like as following: Suppose the threshold is set to 2, the features selected will be FindFirstFile, GetLongPathName, and GetLastError. Then the feature vector for File1 will be as follows: APIs are provided by the operating system to access the low-level hardware through system calls for the application programs. The attackers use the same set of API to do malicious activities. So the presence or absence of an API in the log is not enough to predict whether the given file is malware or not. In our work, we consider the API call sequence. The similarity in the call sequence between files in the same class must be greater than the similarity between the files in the different classes. We use the n-gram based method to analyze the call sequence called API-call-grams. As the size of the n-gram increases, the number of similar n-grams between two files within the same class itself is very less. On the other hand, the analysis based on unigram is same as checking whether the API is present or not in a file. So in our work, we consider only 3-API-call-grams and 4-API-callrams.
The feature vector is created as shown in the table. The set of 3 and 4 API-call-grams are generated for each file from the call sequence log which is processed. For each file, n-gram set are sorted, and the grams which are below to a threshold are eliminated. A table for both API-call-grams (3-API-call-grams and 4-API-call-grams) is created in which the the data are: the binary file in the dataset and the corresponding API-call-grams from the n-gram set. Thus the table contains a global list of API-call-grams which in turn sorted with frequency, and we eliminate some API-call-grams with low frequency. The selected API-call-grams constitute the features. Algorithm 2 shows the dynamic feature extraction process. A sample feature vector created by the algorithm is shown in Table 3

The Embedded feature
The proposed method uses the embedded features, which is the feature vector contains both static features and dynamic features. The embedded feature vector is used to classify the binary files. The embedded feature vector will look like as given in Table 4 which a concatenation of both static PSI feature and dynamic API call sequence features.

Machine Learning
Many researchers have already used the machine learning techniques for classifying the malware [13,14].But, in our work, static and dynamic features are combined together and the resultant feature vector fed as input to the machine learning algo for the purpose of training and classification. Association vectors, decision tree, support vector machines, and random forest are the most popular machine learning algorithms, used for malware classification. But literatures shows that random forest and support vector machines are more efficient. Hence we will use random forest and support vector machines.

EXPERIMENT AND RESULT
The static analysis is conducted on 997 virus files, and 490 clean files each analyzed using the strings utility. The experimental environment is set up on an Ubuntu 14.04 machine. In the Ubuntu machine, the strings utility is run for each of the binary files. The analysis output of each file is written into a file having the name same as that of the binary file. We extracted all the strings from the output file (containing PSI), which having length greater than 8 bytes and then fed these into the algorithm, as input, to create the feature set. There are 835 static features are extracted in our analysis. Dynamic feature extraction is done by executing the same binary files used in the static analysis in the Cuckoo malware analysis system. The malware analyzer will provide the log of sequence of API calls. The environment is set up on Ubuntu 10.04 LTS operating system. The analyzer system is configured to work with a virtual machine (VMWare workstation 10.0) inside which we installed three windows XP operating system as the host machines. These machines are called analysis host machines. The binary files are executed on these machines. N-grams are created for API call sequence of each binary file in the dataset, and the feature vector is created as explained in the previous section. In our experiment, 573 4grams and 262 3-grams features were selected to create the feature vector. The machine learning tool WEKA [15] is used for classification. Table 5 shows the classification results of static, dynamic and embedded methods using SVM and Random forest algorithms.

CONCLUSION
In this work, we have presented an embedded approach that uses both static and dynamic features for malware detection.
We have proven our thesis that combined static and dynamic features will increase the detection accuracy than stand-alone static and dynamic methods.
The results achieved show that the support vector machine technique of machine learning is best equipped to classify our data. However with random forest also gives better accuracy along with the improvements in the FP and FP rates. From the classification results, it is clear that dynamic analysis is better than code based static methods. The dynamic method has more accuracy than static methods. As with the objective of the study, it is clear that the embedded approach increases the detection accuracy. The embedded method is found to be 1.5% better than dynamic analysis with a 98.72% classification accuracy. Also, the results show that the method has higher accuracy compared with methods in the literature survey.
To continue our work, we can extract more features form static and dynamic features and reduce the number of features to improve the efficiency of the classification. Feature selection algorithms can be used to reduce the count of features.