In this post, the Article "Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection" is summarized.
Published in :
2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Link to paper : Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection
Summary:
Based on a personal experience, here in bilkent university where I was accused of cheating in a homework assigment, I discovered that plagiarism is one of the most serious crimes in academia. However, when I was trying to understand that rule and how it works, I discovered that plagiarism is based on similarity which is the main problem we are facing in the second part of our Natural language processing term project and this is why I chose this published paper that explain what plagiarism is and how it can be detected.
This paper investigates a number of different methods of external plagiarism detection where the suspucious documents are compared to a bunch of collected documents. In fact, semantics and linguistic variations play a very important role in that detection.
This paper investigates a number of different methods of external plagiarism detection where the suspucious documents are compared to a bunch of collected documents. In fact, semantics and linguistic variations play a very important role in that detection.
The general approach, dealing with external plagiarism, includes the following steps:
- pre-processing
- candidate document selection
- document comparisons
- passage boundary detection and evaluation
Some of the related works mentioned in this paper are;
- N-gram based method proposed by "Efstathios Stamatatos"
- Fuzzy-Token matching method based on string similarity join by "Jiannan Wang, Guoliang Li & Jianhua Feng"
Based on those works ( basic methods) , the proposed system performs as described in the Following schematic diagram :
The input is a set of source document and its corresponding suspicious document.
After comparison of suspicious text with a source text, all the matched N-grams are stored. Passages are formed in both suspicious and source document based on a passage boundary condition which acts as a threshold value. This value depends on total word length of text. Then the fuzzy-semantic method is used to improve the similarity calculation.
The output, using partial dataset from PAN 2012 corpus for testing, is divided into five categories according to the type of Plagiarized Passage; Highly, Low, Simulated, No-obfuscation and No-Plagiarized.
Finaly, using the data statistics given from those cathegories, the different pre-processing methods based on NLP techniques used in this system are compared using fuzzy-semantic and improved fuzzy-semantic similarity measures and the experimental results show that POSPIFS ( POS method integrated with improved fuzzy-semantic similarity measure) performs more efficiently, in terms of accuracy and efficiency, than the other methods.
Aucun commentaire:
Enregistrer un commentaire