Abstract Syntax Trees with Latent Semantic Indexing for Source Code Plagiarism Detection

Main Article Content

Resmi N.G.
K.P. Soman


In this paper, we study and analyze the results of combining two source code plagiarism detection approaches by making some modifications as compared to the existing systems to detect source code plagiarism in academic field. Structure based techniques which have increased efficiency in detecting similarity compared to software metric based techniques are generally computationally complex. Here, we combine an attribute-metric based detection approach - Latent Semantic Indexing (LSI), with a structure based approach - Abstract Syntax Tree (AST) comparison. LSI is first used for identifying a set of potentially plagiarized programs which are further tested for similarities by comparing their abstract syntax trees. Use of LSI for screening reduces the computational cost involved in tree generation and comparison. Moreover, we have modified the preprocessing stage of LSI and have added a post processing stage for improved performance. Our method was tested for C, C++ and Java source code files. Both the approaches were initially tested individually for a collection of student programs of varying functionality and size. These were then combined and found to give better results than executing independently. The performances are evaluated by calculating the precision and recall.


Keywords: abstract syntax trees; latent semantic indexing; plagiarism detection; singular value decomposition


Download data is not yet available.

Article Details