samedi 8 décembre 2018

Paper Review 10 : Sentence Similarity Based on Semantic Vector Model

In this post, the Article "Sentence Similarity Based on Semantic Vector Model " is summarized. 

Written by :
  • Zhao Jingling, School of Computer,  Beijing University of Posts and Telecommunications  Beijing, China 
  • Zhang Huiyun National Engineering Laboratory for Mobile Network Security, School of Computer,  Beijing University of Posts and Telecommunications  Beijing, China 
  • Cui Baojiang   National Engineering Laboratory for Mobile Network Security, School of Computer  Beijing University of Posts and Telecommunications  Beijing, China 
Published  in : 2014 Ninth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing




Summary:

According to the authors, a sentence is “considered to be a sequence of words each of which carries useful information … Words sequence contains semantic features, word order information and structural characteristics of sentence, which sentence similarity relies on.”

Previous methods, in order to compute sentence similarity, are used for long text documents. However, this paper represents a new approach applicable for very short texts of sentence length based on semantic information, structure information and word order information.

In fact, this proposed approach is a combination of:

  •    Semantic similarity between sentences based on:

  •   The word semantic similarityIt can be obtained using dictionary/thesaurus-based methods or corpus-based methods.How-net, which defines a word in a complicated multidimensional knowledge description language, is the lexical knowledgebase employed in this research. In how-net, a word is a group of small units (sememes) that describe the word meaning.
  •     The structure of sentences

Each sentence is represented by a semantic vector calculated using the joint word set of the sentences word sets in which, element values are in range [0,1].

  •    Word order similarity between sentencesThat provides information about relationship between words, since word order plays a role in conveying the meaning of sentences.


So, the overall sentence similarity is defined as:



Where epsilon is a factor for weighting the significance between semantic information and word order information with 0.85 as a value which is empirically found.

Finally,this approach shows the best results comparing to other methods such as method based on semantic and words order proposed by Li Yuhua, method based on semantic, method based on words order. That's why we aim to use it in our project.

mercredi 5 décembre 2018

Paper Review 9 : Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection

In this post, the Article "Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection" is summarized.

Deepa Gupta, Dept. of Mathematics ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Vani K, Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Charan Kamal, Singh Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India 


Summary:

Based on a personal experience, here in bilkent university where I was accused of cheating in a homework assigment, I discovered that plagiarism is one of the most serious crimes in academia. However, when I was trying to understand that rule and how it works, I discovered that plagiarism is based on similarity which is the main problem we are facing in the second part of our Natural language processing term project and this is why I chose this published paper that explain what plagiarism is and how it can be detected.
This paper investigates a number of different methods of external plagiarism detection where the suspucious documents are compared to a bunch of collected documents. In fact, semantics and linguistic variations play a very important role in that detection. 
The general approach, dealing with external plagiarism, includes the following steps: 
  • pre-processing 
  • candidate document selection 
  • document comparisons 
  • passage boundary detection and evaluation

Some of the related works mentioned in this paper are; 
  • N-gram based method proposed by "Efstathios Stamatatos"
  • Fuzzy-Token matching method based on string similarity join by "Jiannan Wang, Guoliang Li & Jianhua Feng"

Based on those works ( basic methods) , the proposed system performs as described in the Following schematic diagram :


The input is a set of source document and its corresponding suspicious document. 

After comparison of suspicious text with a source text, all the matched N-grams are stored. Passages are formed in both suspicious and source document based on a passage boundary condition which acts as a threshold value. This value depends on total word length of text. Then the fuzzy-semantic method is used to improve the similarity calculation.

The output, using  partial dataset from PAN 2012 corpus for testing, is divided into five categories according to the type of Plagiarized Passage; Highly, Low, Simulated,  No-obfuscation and No-Plagiarized.

Finaly, using the data statistics given from those cathegories, the different pre-processing methods based on  NLP techniques used in this system are compared using fuzzy-semantic and improved fuzzy-semantic similarity measures and the experimental results show that  POSPIFS ( POS method integrated with improved fuzzy-semantic similarity measure) performs more efficiently, in terms of accuracy and efficiency, than the other methods.

Presentation: Dorra EL MEKKI

Link :   presentation_dorraElMekki