SENTIMENT ANALYSIS USING A NOVEL APPROACH TO CLASSIFY SENTIMENTS IN SOCIAL NETWORKING DATA

: Sentiment analysis is the task of finding polarity in the given document. The document could be a sentence, a paragraph or a document with number of pages. Polarity of the document could be positive, negative or neutral. This polarity reflects the mood and emotions of the user. Twitter is the most popular social media today. It is the biggest platform for communication. In this research, tweets from twitter is taken for sentiment analysis. The biggest challenge lies in identifying the document accurately for its polarity. There are number of machine learning algorithms available using supervised or semi supervised technique. These algorithms apply unigram, bigram, n-gram or hybrid approach. Semi supervised learning is being used for this research paper. In this work, unigram and bigram approach are combined together to form novel model that uses Naïve Bayes approach and results were found. This novel approach gave a better result. A time based analysis was also performed in order to find the day wise polarity of the tweets


I. INTRODUCTION
Sentiment analysis using unigram approach, can recall that sentiment analysis also known as opinion mining, is meant to analyze the opinion of people towards any topics like products, organizations and other related attributes. In this present day, social media plays an enormous role in providing quality information about any topics ranging from different reviews blogs and comments. This paper would produce a supervised learning method on labeled data. The label would mean the tweets and the time. Twitter is a rich source of unstructured data. In this novel method using Naïve Bayes approach, the data is first grabbed from twitter using NodeXL, then some manual preprocessing needs to be done, i.e. all unwanted columns need to be removed and only tweets need to be kept. These tweets will then go under vector format where the text data can be converted into matrix of numbers. As mentioned unigram approach in the previous chapter, this bigram approach will evaluate the unigram approach first, then perform a bigram operation and give a combined result in the form of a graph. The method that is being used in this bigram approach is still the Naïve Bayes, which yields a better accuracy than the normal Bayes algorithm.

II. LITERATURE SURVEY
Sentiment analysis is a branch of study from the field of opinion mining. Number of researchers worked under these fields to bring out the best outcome. A survey has been done to analysis about the technique and tools available in sentiment analysis. J.M. Weibe [1], the researcher brings out different algorithms in best identification of sentiment analysis. M. A. Hearst [2] had come up with adding intelligence to sentiment analysis. Different machine learning methods are being used by researchers. Researcher V. Suresh [3] presented an approach that used stop words and gaps between stop words as the feature for sentiment analysis. Murthy G. and Bing Liu [4] made a comparative study on sentences and web context based sentiments. The authors Pang 2002 [5] and Matsumoto 2005[6] suggested with unigram approaches in their research work. Dave, Lawrence and Pennock [7] used a tool to synthesize reviews. Matsumoto, Takamura and Okumura [9] researched on document level syntactic relationship among the words was found by them. Liu and Chen [10] proposed multilevel classification on sentiment analysis. Harvinder Jeet Kaur and Rajiv Kumar [11] researched on different methods to perform automatic polarity classification of textual data. Xiaoming Gao, Emilio Ferrara, and judy Qiu [12] wanted to show that a powerful general software subsystems will enable many other applications that need integration of streaming and batch data analytics. J. Prabhu and M. Sudharshan and M. Saravanan and G. Prasad [13] discussed about the use of Rapid clustering Method to analyze the characteristics in social network. Xin Chen, Krishna Madhavan and Mihaela Vorvoreanu [14] used a special system called Social Web Analysis Buddy (SWAB) to analyse student-posted content on Social Media sites to facilitate the understanding of human behaviours and social tendencies.

III. TWITTER DATA
The goal of this paper is to implement a novel method that uses Naïve Bayes approach and find the result for accuracy. Many researchers had worked on domain specific sentiment analysis. This research aims in suggesting a new model that classifies any dataset. Twitter is the most popular social media platform for communicate, where user can share and express their thoughts on any topic in the form of tweets. In this implementation, tweets are taken for sentiment analysis. Twitter offers the facility of accessing tweets by creating an account. The dev.twitter.com application panel offers the facility to make an OAuth access token for the title holder of the application. This is suitable if, the application only needs to make requests for establishing a connection to the API instead of the user manually doing so every time or to test the functionality of the API from a single user.

IV. NAÏVE BAYES APPROACH
Naïve Bayes model is very easy to build and it is best suitable method while using or working with large data sets. It is known for its simplicity. Though it is very simple to build, it out performs than various other data classification algorithms. It provides a way for classifying and calculating the subsequent probability [8]. It is a statistical model that uses probabilistic equation for classification. Given the equation, Equation 1 Above, • P (c|x) is the subsequent probability of class (c, target) given predicator (x, attributes).
• P(c) is the prior probability of class.
• P (x|c) is the possibility, which remains the probability of predictor given class. • P(x) is the former probability of predictor.

V. SYSTEM DESIGN
To implement this new method, a desktop application is developed using Microsoft visual Studio along with C# programming language. A simple windows form is designed that takes the input file and produces the result in the graph format. The tweets will contain noisy data like, smiley, abbreviations, URLs, emoticons, special character, different language words and many others. This implementation accepts the input file in which these noisy data are removed manually. Only the English sentences are stored in the excel file. In order to be read correctly by the program, the following structure needs to be followed.
• The file that is given as input should be a .xls (Microsoft Excel File)or .csv file • The dataset should first contain the column of tweets and the next column of the tweeted date • The first column (tweets) can contain any type of data as long as it's the tweet itself.
• The second column should only be of the type date. Lot of features provided such as intelliSense, designers and debugging. C# is an object programming language that is designed to be fully compatible with Microsoft .NET framework. .NET framework is a software framework that consists of a large library and language interoperability between several programming languages. Large number of common functions is found in the base class library.

VI. CROSS VALIDATIONS
Cross-validation is a technique used in evaluation of predictive models. In a prediction problem, a model is usually given a dataset of known data on which training is run and a dataset of unknown data against which the model is. It is also called as rotation estimation [15][17] [18] It is based on the principles of testing the algorithm on a new dataset that yields a better estimate of its performance. The samples used for training are split into validation samples and training samples. Crossvalidation combines (averages) measures of fit (prediction error) to derive a more accurate estimate of model prediction performance [16]. The training samples are used to train the algorithm and the validation samples is used as new data to evaluate the performance and working of the algorithm. This algorithm is implemented in such a way that it can run, three cross validation methods: • Holdout method • K-fold cross validation • Leave-one-out cross validation Naïve Bayes system is designed using windows forms that read an input file .xls containing the data set (tweets). The input dataset is split into training set and test set. The training set is then used to calculate the probabilities of each class. The conditional probabilities of each class are calculated using single instance from the test set. Posterior probabilities of each class are then

A. Holdout Method
One of the simplest types of validation where the dataset is split into two sets, namely, the training set and the validation set. The algorithm is trained using the training set only. Data is then evaluated by the algorithm using the validation set. The evaluation can have a high variance as the evaluation may depend solely on the data that is present in the training set.
B. K-Fold Cross Validation K-fold cross validation acts as an improvement of the hideout method. The dataset is repeated k times and divided into k subsets in k-fold cross validation method. In each instance only one of the k sets are used and the remaining k-1 sets are put together to form the training set. The errors across all the trails are then averaged. The disadvantage of K-fold cross validation method is that it takes 'k' times more computational time than because the algorithm is meant to run k times.

C. Leave-One-Out Cross Validation (LOOCV)
The extremely logical form of k-fold crosses validation where k equals the number of data points. The training on the algorithm is done on all data points except for one. LOOCV can be computationally costly on the grounds since it is basically obliges one to build many models -measure up to in number to the extent of the training set. Applying the Bayes theorem from

IX. CONCLUSION
This research work aims implementing a novel method of sentiment classification of social media data. This is research implements a novel approach that uses Naïve Bayes approach of sentiment classification and yields more accurate result. The accuracy rate is also calculated. This novel approach accepts the social media data for classification in the excel file and classifies as Positive, Negative and Neutral. This is graphically represented and a time series line is also calculated for future analysis.