mercredi 5 décembre 2018

Paper Review 9 : Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection

In this post, the Article "Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection" is summarized.

Deepa Gupta, Dept. of Mathematics ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Vani K, Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Charan Kamal, Singh Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India 


Summary:

Based on a personal experience, here in bilkent university where I was accused of cheating in a homework assigment, I discovered that plagiarism is one of the most serious crimes in academia. However, when I was trying to understand that rule and how it works, I discovered that plagiarism is based on similarity which is the main problem we are facing in the second part of our Natural language processing term project and this is why I chose this published paper that explain what plagiarism is and how it can be detected.
This paper investigates a number of different methods of external plagiarism detection where the suspucious documents are compared to a bunch of collected documents. In fact, semantics and linguistic variations play a very important role in that detection. 
The general approach, dealing with external plagiarism, includes the following steps: 
  • pre-processing 
  • candidate document selection 
  • document comparisons 
  • passage boundary detection and evaluation

Some of the related works mentioned in this paper are; 
  • N-gram based method proposed by "Efstathios Stamatatos"
  • Fuzzy-Token matching method based on string similarity join by "Jiannan Wang, Guoliang Li & Jianhua Feng"

Based on those works ( basic methods) , the proposed system performs as described in the Following schematic diagram :


The input is a set of source document and its corresponding suspicious document. 

After comparison of suspicious text with a source text, all the matched N-grams are stored. Passages are formed in both suspicious and source document based on a passage boundary condition which acts as a threshold value. This value depends on total word length of text. Then the fuzzy-semantic method is used to improve the similarity calculation.

The output, using  partial dataset from PAN 2012 corpus for testing, is divided into five categories according to the type of Plagiarized Passage; Highly, Low, Simulated,  No-obfuscation and No-Plagiarized.

Finaly, using the data statistics given from those cathegories, the different pre-processing methods based on  NLP techniques used in this system are compared using fuzzy-semantic and improved fuzzy-semantic similarity measures and the experimental results show that  POSPIFS ( POS method integrated with improved fuzzy-semantic similarity measure) performs more efficiently, in terms of accuracy and efficiency, than the other methods.

Aucun commentaire:

Enregistrer un commentaire

Presentation: Dorra EL MEKKI

Link :   presentation_dorraElMekki