ANNOTATING FEATURES EXTRACTED THROUGH LATENT DIRICHLET ALLOCATION FOR FEATURE BASED OPINION MINING

: Online product reviews contains opinions about products and their features. These product reviews are plain text and therefore analysis of these reviews requires more efforts. In this paper, we tackle the problem of features based opinion mining of product reviews using LDA topic model and proposed annotation algorithm. We proposed an architecture for feature based opinion mining based on topic models and an algorithm that automatically annotates features extracted through LDA topic model. The experimental result shows that the algorithm gives average feature annotation accuracy 77.14%, average positive polarity annotation accuracy 86.02% and average negative polarity annotation accuracy 88.57%. The algorithm can be used with different topics models as well.


INTRODUCTION
There are many online shopping websites which ask their customers to review products. The numbers of customer reviews that product receives grows rapidly day by day. These reviews are very useful for product manufactures as well as people planning to purchase that product. As the numbers of reviews are very large in number it is difficult to analyze people opinions, sentiments, evaluations, appraisals, attitudes and emotions towards product and product features. So different techniques are used in area of Feature Based Opinion Mining or Aspect Based Sentiment Analysis to analyze and summarize product reviews. These techniques are expected to extract product features about which reviewer has commented on along with the opinion or sentiment expressed, find out expressed opinion is positive or negative and then summarize how many positive and negative opinions are expressed on particular product and product features.
Different approaches are proposed to solve problem of feature based opinion mining. Liu [1] classified these approaches into four categories: 1.Finding frequent nouns and noun phases, 2. Using opinion and target relations, 3. Using supervised learning, 4. Using topic models and mapping implicit aspects. Schouten and Frasincar [2] discussed taxonomy for aspect-level sentiment analysis approaches. They classified different feature based opinion mining tasks into different approaches. For feature based opinion mining they discussed four approaches: 1. Syntax based approach, 2.Supervised machine learning approach, 3. Unsupervised machine learning approach, and 4. Hybrid machine learning approach.
One of the most popular unsupervised machine learning approaches for feature based opinion mining is topic modeling. In machine learning and natural language processing, topic modeling is type of statistical modeling for discovering the abstract topics that occurs in a collection of documents [3]. Topic modeling is unsupervised learning method and it assumes each document is consists of a mixture of topics and each word is probability distribution over words [4].
Mainly there are two basic topic models Probabilistic Latent Semantic Indexing (PLSI) and Latent Dirichlet Allocation (LDA). There are many topic models which are specifically proposed to solve problem of feature based opinion mining, which are either based on PSLI or LDA. These models are very good for feature extraction from reviews. But we need to annotate these features manually and they do not summarize the reviews. So in this paper, we proposed system architecture and the annotation algorithm which automatically annotate of features extracted through LDA topics model, finds the polarity orientation for each extracted feature and summarize the product reviews featurewise.
Contributions of this paper are -1. System architecture for feature based opinion mining and summarization 2. The annotation algorithm which automatically annotates features extracted through LDA topics model for feature based opinion mining and summarize the product reviews.
This paper is organized as follows. Section-2 discusses related work, Section-3 discusses proposed system, Section-4 focuses on experiments and results and Section-5 concludes the paper with future directions. [5] proposed Multi-grain topic models for extracting ratable features from reviews. This model is extended in [6] for summarizing the extracted features. It is a joint model of text and aspect ratings for extracting text to be displayed in sentiment summaries. Our proposed algorithm also generate summary for feature wise rating of particular product.

Titov and McDonald
Lu et al. [7] proposed a topic model based on PLSI to generate a rated aspect/ feature summary. The model decomposes view of overall ratings for major features and gives feature-wise rating. This model is useful for short comments and it needs overall rating of short comments as input. Our proposed system work on complete product reviews which are not rated.
Brody and Elhadad [8] proposed LDA based model for feature based opinion mining and summarization. Model extracts topics as features at sentence level. Polarity identification is done for each feature using seed adjectives with known polarity. This model is flexible with regard to domain and language of review. This model is closely related to our proposed system. Proposed model also uses positive and negative seed words. In [8], conjunction graph is built over adjectives for each feature for polarity identification and we proposed a simple algorithm for this purpose.
Wang et al. [9] proposed model which does not need pre-specified features of products. It uses review ratings. This model mines latent topical features, ratings on each identified feature aspect and weights placed on different features by a reviewer. This model can be applied to various domain data. In each above work topic model is proposed for feature extraction and then summary is generated. We used Latent Dirichlet Allocation topic model to extract features and then proposed algorithms is used for annotation and summary generation. Figure 1 shows proposed system architecture. The system performs summarization in four steps: (1) preprocess the dataset, (2) apply topic model, (3) annotate the topics, (4) generate summary. These steps are performed in multiple substeps:

A. Preproceesing
Preprocessing is done in three steps. Data cleaning is done to remove unwanted part of reviews in dataset. For example, we do not need ReviewerID, ReviewerName in reviews so we removed this information from reviews in dataset. After data cleaning, stopwords are removed from reviews. Stopwords are the common English words with high frequency of occurrence which are not useful for further processing. Stemming is done after removing stopwords from reviews. In this step, we reduce derived words to their stem word.

B. Apply Topic Model [3]
LDA topic model is a applied on preprocessed review datset to extract product features. Figure 2 shows graphical representation of Latent Dirichlet Allocation topic model. We use Turney's [11] paradigm words as positive and negative seed words. The list of positive and negative seed words is shown in Figure 4 and Figure 5 respectively. We use WordNet [12] to grow this seed list automatically. Synonyms and antonyms of words in seed list are searched using WordNet-3.0.

D. Generate Summary
Finally, we generate summary based on the annotation done by the proposed annotation algorithm.

EXPERIMENTS AND RESULTS
We used amazon dataset which is available on http://uilab.kaist.ac.kr/research/WSDM11. This dataset contains 24259 reviews of 7 product categories. Table 1 shows details of dataset. We performed all the steps on all 7 product categories separately. For stopword removal we use list of 571 words as Stopword list. Stemming is done using Porter Stemming algorithm. We consider K i.e. number of topics 20.
To infer we use Gibbs Sampler with number of iterations 600. Table 2 shows sample output of LDA topic model and Table 3 shows sample output of annotation algorithm a Column heading the table is annotation. All these table shows 10 topics and first 10 rows only. Table 4 and Figure 6 shows result of the algorithm. We calculate the feature annotation and polarity annotation accuracy separately. The generated sample summary is shown in Table 5. In the Table 5 'null' value in feature indicates that the topic is not annotated by the algorithms and therefore it is annotated as 'null'. For 'null' feature also algorithm gives polarity classification. Figure 3: Annotation Algorithms good, nice, excellent, positive, fortunate, correct, superior, amazing, attractive, awesome, best, comfortable, enjoy, fantastic, favorite, fun, glad, great, happy, impressive, love, perfect, recommend, satisfied, thank, worth

CONCLUSION
We proposed architecture for feature based opinion mining and an algorithm that automatically annotates features extracted through LDA topic model. LDA topic model is applies on product reviews to extract product features and proposed annotation algorithm is applied on these extracted features. The proposed algorithm is the algorithm gives average feature annotation accuracy 77.14%, average positive polarity annotation accuracy 86.02% and average negative polarity annotation accuracy 88.57% Accuracy of feature annotation is completely depends on the list of product features. For better performance, the list should contain frequent product features. Accuracy of polarity annotation increases with the seed list which grows automatically.
For future work, the proposed model can be applies to the topic model specifically proposed for feature extraction from product reviews. bad, nasty, poor, negative, unfortunate, wrong, inferior, annoying, complain, disappointed, hate, junk, mess, not good, not like, not recommend, not worth, problem, regret, sorry, terrible, trouble, unacceptable, upset, waste, worst, worthless  air  product  water  dai  call  cooler  easi  window  unit  work   condition  return  drain  run  part  humid  review  hose  problem  great   minut  back  empti  summer  servic  live  good  exhaust  instal  make   blow  box  floor  hour  month  fan  quiet  vent  offic  well   size  week  hose  hous  custom  dry  set  hot  heat  machin   long  item  full  bedroom  ship  expect  happi  heat  plug  area   bother  ship  bucket  night  replac  low  move  portabl  amcor  place   perfect  amazon  requir  temp  start  make  pretty  insul  space