samedi 8 décembre 2018

Paper Review 10 : Sentence Similarity Based on Semantic Vector Model

In this post, the Article "Sentence Similarity Based on Semantic Vector Model " is summarized. 

Written by :
  • Zhao Jingling, School of Computer,  Beijing University of Posts and Telecommunications  Beijing, China 
  • Zhang Huiyun National Engineering Laboratory for Mobile Network Security, School of Computer,  Beijing University of Posts and Telecommunications  Beijing, China 
  • Cui Baojiang   National Engineering Laboratory for Mobile Network Security, School of Computer  Beijing University of Posts and Telecommunications  Beijing, China 
Published  in : 2014 Ninth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing




Summary:

According to the authors, a sentence is “considered to be a sequence of words each of which carries useful information … Words sequence contains semantic features, word order information and structural characteristics of sentence, which sentence similarity relies on.”

Previous methods, in order to compute sentence similarity, are used for long text documents. However, this paper represents a new approach applicable for very short texts of sentence length based on semantic information, structure information and word order information.

In fact, this proposed approach is a combination of:

  •    Semantic similarity between sentences based on:

  •   The word semantic similarityIt can be obtained using dictionary/thesaurus-based methods or corpus-based methods.How-net, which defines a word in a complicated multidimensional knowledge description language, is the lexical knowledgebase employed in this research. In how-net, a word is a group of small units (sememes) that describe the word meaning.
  •     The structure of sentences

Each sentence is represented by a semantic vector calculated using the joint word set of the sentences word sets in which, element values are in range [0,1].

  •    Word order similarity between sentencesThat provides information about relationship between words, since word order plays a role in conveying the meaning of sentences.


So, the overall sentence similarity is defined as:



Where epsilon is a factor for weighting the significance between semantic information and word order information with 0.85 as a value which is empirically found.

Finally,this approach shows the best results comparing to other methods such as method based on semantic and words order proposed by Li Yuhua, method based on semantic, method based on words order. That's why we aim to use it in our project.

mercredi 5 décembre 2018

Paper Review 9 : Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection

In this post, the Article "Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection" is summarized.

Deepa Gupta, Dept. of Mathematics ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Vani K, Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Charan Kamal, Singh Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India 


Summary:

Based on a personal experience, here in bilkent university where I was accused of cheating in a homework assigment, I discovered that plagiarism is one of the most serious crimes in academia. However, when I was trying to understand that rule and how it works, I discovered that plagiarism is based on similarity which is the main problem we are facing in the second part of our Natural language processing term project and this is why I chose this published paper that explain what plagiarism is and how it can be detected.
This paper investigates a number of different methods of external plagiarism detection where the suspucious documents are compared to a bunch of collected documents. In fact, semantics and linguistic variations play a very important role in that detection. 
The general approach, dealing with external plagiarism, includes the following steps: 
  • pre-processing 
  • candidate document selection 
  • document comparisons 
  • passage boundary detection and evaluation

Some of the related works mentioned in this paper are; 
  • N-gram based method proposed by "Efstathios Stamatatos"
  • Fuzzy-Token matching method based on string similarity join by "Jiannan Wang, Guoliang Li & Jianhua Feng"

Based on those works ( basic methods) , the proposed system performs as described in the Following schematic diagram :


The input is a set of source document and its corresponding suspicious document. 

After comparison of suspicious text with a source text, all the matched N-grams are stored. Passages are formed in both suspicious and source document based on a passage boundary condition which acts as a threshold value. This value depends on total word length of text. Then the fuzzy-semantic method is used to improve the similarity calculation.

The output, using  partial dataset from PAN 2012 corpus for testing, is divided into five categories according to the type of Plagiarized Passage; Highly, Low, Simulated,  No-obfuscation and No-Plagiarized.

Finaly, using the data statistics given from those cathegories, the different pre-processing methods based on  NLP techniques used in this system are compared using fuzzy-semantic and improved fuzzy-semantic similarity measures and the experimental results show that  POSPIFS ( POS method integrated with improved fuzzy-semantic similarity measure) performs more efficiently, in terms of accuracy and efficiency, than the other methods.

jeudi 29 novembre 2018

Paper Review 8 :Text coherence new method using word2vec sentence vectors and most likely n-grams



In this post, the paper "Text coherence new method using word2vec sentence vectors and most likely n-grams" is summarized.

Link to paper: https://ieeexplore.ieee.org/document/8311598  


Mohamad Abdolahi Kharazmi, Morteza Zahedi Kharazmi,Text coherence new method using word2vec sentence vectors and most likely n-grams, 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS), Iran, 20-21 Dec. 2017IEEE Xplore.

Summary:

This paper investigates the automatic evaluation of text coherence, which is a fundamental post processing in many NLP tasks such as machine translation and question answering .

The approach proposed in this paper combines the word2vec vectors and most likely  n-grams to assess the coherence and topic integrity of document. It uses new technologies like deep learning, transforming words into numerical vectors and using statistical methods to assess the coherence of texts.

The authors evaluate the text coherence with statistical methods with both local and global coherence which captures text organization at the level of sentence to sentence and paragraph to paragraph transitions. Without caring about the meaning of words or the handcrafted rules. So, the approach does not depend on the language and its semantic concepts and it has the ability to apply on any language.

Instead of the other methods, the preprocessing here is different.

First of all, each document is transformed to separate sentences. Secondly, sentence’s matrix is created using word2vec word vectors. Finally, it is normalized using n-grams model.
Stop words, stemming and POS tagging are not performed.
However, some basic preprocessing are used such as removing spacing between words and
punctuation marks, removing extra spaces characters between words and unification of accented characters.
Not only the preprocessing is different but also in previous approaches, local coherence is tested at the level of several consecutive sentences. That is why, sections  with an important distance may not have any relation. However, in the proposed approach, local coherence is raised at the level of a paragraph and a coherent paragraph is assumed a local coherent section.

To conclude, this model is very sufficient. It is robust among language and domains  and it doesn’t suffer from computational complexity.

jeudi 22 novembre 2018

Paper Review 7 : A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences

In this post, the research Article "A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences" is summarized.
Ming Che Lee: Departement of computer and communication engineering, Ming chuan University, Taiwan.
Jia Wei Chang: Department of Engineering Science, National Cheng Kung University, Taiwan.
Tung Cheng Hsieh : Departement of visual communication Design, Hsuan Chuang University, Taiwan
Published 10 April 2014

Summary:


Since a part of our project is about finding similarity between terms (sentence) and documents (text; job descriptions), we chose this published paper that disscuss almost all the technics and approaches known , from the day computer scientist start the natural language processing to the day of writing this article.

This paper, first,  presents the different existing models such as Latent Semantic analisys/indexing (LSA/LSI), Hyper data of language (HAL), Probabilistic latent semantic analysis/indexing (PLSA/PLSI) and the Vector space model (VSM), with a brief comparison. The conclusion was that those approaches calculate the similarity based on the number of shared terms in articles, instead of overlook the syntactic structure of sentences and some disadvantages may arise when applying them to calculate the similarity between short texts/sentences directly. Then, it presents a new approach which is a grammar and semantic corpus based similarity algorithm for natural language sentences. 

This new approache addresses the limitations of these existing approaches by using grammatical rules and the WordNet ontology.

Traditional information retrival  technologies may not always determine the perfect matching without obvious relation or concept overlap between two natural language sentences. Some approaches deal with this problem via determining the order of words and the evaluation of semantic vectors; however, they were hard to be applied to compare the sentences with complex syntax as well as long sentences and sentences with arbitrary patterns and grammars. 

The proposed approach takes advantage of corpus-based ontology and grammatical rules to overcome this problem by a set of grammar matrices that is built for representing the relationships (corrolations) between pairs of sentences  instead of considering common words or word order. The size of the set is limited to the maximum number of selected grammar links. The latent semantic of words is calculated via a WordNet similarity measure, semantic trees, that  increases the chances of finding a semantic relation between any nouns and verbs. 

finally,the results demonstrate that the proposed method performed very well both in the sentences similarity and the task of paraphrase recognition.

dimanche 11 novembre 2018

Paper Review 6 :PhraseCTM: Correlated Topic Modeling on Phrases within Markov Random Fields

In this post, the paper "PhraseCTM: Correlated Topic Modeling on Phrases within Markov Random Fields"is summarized.


Weijing Huang, Tengjiao Wang, Wei Chen, Siyuan Jiang, Kam-Fai Wong, PhraseCTM: Correlated Topic Modeling on Phrases within Markov Random Fields, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 521–526 Melbourne, Australia, July 15 - 20, 2018.Association for Computational Linguistics.


Summary:

PhraseCTM is a novel method proposed in order to find out the correlated topics at phrase level.

This work is done in two stages:
  •  The training stage to extract the model using Markov Random Fields to link the phrases and the words when they are semantically coherent.
  • Generating the correlation of topics from PhraseCTM and evaluating the model using a quantitative experiment and human study.


CTM is a solution for what?

It is easier to understand the topic using Topic modeling on phrases “grounding conductor, grounding wire, aluminum wiring” than to understand it using Topic modeling on words“ground, wire, use, power, cable, wires” which doesn’t include the context. However the first method is complex when the size is important.
So, CTM applies the correlation structure in order to figure out the correlated relationship between topics and group the similar topics together.

When CTM is performing well?
           
Phrases are much less than words in each document and CTM needs more contextual information to build a performed model. So we don't use it with short documents.

How CTM works?

1. Training PhraseCTM:
 Transform a document into words and phrases semantically coherent.
 Link between phrases and component words:
  •    Calculate the NPMI metric.
  •    Define a threshold to take a decision.
  •    Double count the phrases as two parts, one as the phrase itself, the other as the component words.
  •    Model the generation of words and phrases simultaneously by linking the phrases and component words within Markov Random.

2. Generating the correlation of topics :
Evaluate the method on five datasets by a quantitative experiment and a human study.

Conclusion:
CTM is a solution that helps finding the topic of the corpus and it has demonstrated a high-quality phrase-level topics.


  

jeudi 8 novembre 2018

Paper Review 5: REVIEW ON NATURAL LANGUAGE PROCESSING

In this post, the paper "REVIEW ON NATURAL LANGUAGE PROCESSING" is summarized. Prof. Alpa Reshamwala, Prajakta Pawar, Prof. Dhirendra Mishra,
IRACST – Engineering Science and Technology: An International Journal (ESTIJ), ISSN: 2250-3498,  Vol.3, No.1, February 2013 Link to paper : REVIEW ON NATURAL LANGUAGE PROCESSING Summary:
Natural Language processing is an branch of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) language.  Natural languages are languages spoken by humans.  Whatever the form of the communication, natural languages are used to express our knowledge and emotions and to convey our responses to other people and to our surroundings.The research work in NLP , which is being a very active area of research and development, has been increasingly addressed in the recent years. 

LEVELS OF NLP 

The most explanatory method for presenting what actually happens within a NLP system is by means of the ‘levels of language’ approach.  Psycholinguistic research suggests that language processing is much more dynamic, as the levels can interact in a variety of orders.  For example, the pragmatic knowledge that the document you are reading is about biology will be used when a particular word that has several possible senses is encountered, and the word will be interpreted as having the biology sense. A. Phonology : deals with the interpretation of speech sounds within and across words.
    B. Morphology : It looks at the ways in which words break 
       down into their components and how that affects 
       their grammatical status.
    C. Semantics : It builds up a representation of the objects 
       and actions that a sentence is describing and includes 
       the details provided by adjectives, adverbs and propositions.
    D. Pragmatics : is “the analysis of the real meaning of an
       utterance in a human language, by disambiguating and
       contextualizing the utterance”.

METHODS AND APPROACHES 

A. Natural Language Processing for Speech Synthesis:



TTS synthesis makes use of NLP techniques extensively since text data is first input into the system and thus it must be processed in the first place. B. Natural Language Processing for Speech Recognition: Automatic Speech Recognition systems make use of NLP techniques in a fairly restricted way: they are based on grammars. (This paper refers to a grammar as a set of rules that determine the structure of texts written in a given language by defining its morphology and syntax.)


While NLP is a relatively recent area of research and application, as compared to other information technology approaches, there have been sufficient successes to date that suggest that NLP-based information access technologies will continue to be a major area of research and development in information systems now and far into the future. 

jeudi 1 novembre 2018

Paper Review 4 :SuperNMT: Neural Machine Translation with Semantic Supersenses and Syntactic Supertags

In this post, the paper "SuperNMT: Neural Machine Translation with Semantic Supersenses and Syntactic Supertags"is summarized.


Eva Vanmassenhove, Andy Way, SuperNMT: Neural Machine Translation with Semantic Supersenses and Syntactic Supertags, Proceedings of ACL 2018, Student Research Workshop, pages 67–73 Melbourne, Australia, July 15 - 20, 2018. Association for Computational Linguistics


Summary:

   Neural Machine Translation (NMT) learn by generalizing patterns based on the raw, sentences providing semantic supersensetags and syntactic supertag together make NMT system learn multi-word expressions.

This exclusive research enhances word-embeddings in NMT systems by providing a combination of semantic supersenses like (CCG supertags) and syntactic supertag like (SST).

CCG tags every word in a sentence with its correct role (verb/noun/auxiliary) and therefore resolve ambiguity in terms of prepositional attachment.

SST, in the other hand, classifies words independently into their part-of-speech (Modal/ Adverb/ noun)

These 2 new embedding vectors are then concatenated into the classical embedding vector and used in the model.

Adding explicitly this level of semantics provides the translation system with a higher level of abstraction beneficial to learn more complex constructions.

3 NMT systems are trained and tested on 1M sentences of the Europarl corpus for EN–FR and EN–DE:
One based on supersenses (SST), one on syntactic supertag (CCG), and one on both (SST–CCG).

Results for the EN–DE system shows that SST system converges faster, as hypothesized. The learning curve is also more consistent.

But if we focus on later stages of the learning process, CCG-SST model outperforms the best model: translations are 5% better comparing to the 2 other systems.

The results for the EN–DE system and EN–FR system are very similar and lead to the same conclusion: combining semantic and syntactic features is beneficial for generalization which lead to a better translation.

Presentation: Dorra EL MEKKI

Link :   presentation_dorraElMekki