vendredi 14 décembre 2018
samedi 8 décembre 2018
Paper Review 10 : Sentence Similarity Based on Semantic Vector Model
In this post, the Article "Sentence Similarity Based on Semantic Vector Model " is summarized.
Written by :
- Zhao Jingling, School of Computer, Beijing University of Posts and Telecommunications Beijing, China
- Zhang Huiyun National Engineering Laboratory for Mobile Network Security, School of Computer, Beijing University of Posts and Telecommunications Beijing, China
- Cui Baojiang National Engineering Laboratory for Mobile Network Security, School of Computer Beijing University of Posts and Telecommunications Beijing, China
Published in : 2014 Ninth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing
Link to paper : Sentence Similarity Based on Semantic Vector Model
Summary:
According to the authors, a sentence is “considered to be a sequence of words each of which carries useful information … Words sequence contains semantic features, word order information and structural characteristics of sentence, which sentence similarity relies on.”
Previous methods, in order to compute sentence similarity, are used for long text documents. However, this paper represents a new approach applicable for very short texts of sentence length based on semantic information, structure information and word order information.
In fact, this proposed approach is a combination of:
- Semantic similarity between sentences based on:
- The word semantic similarity: It can be obtained using dictionary/thesaurus-based methods or corpus-based methods.How-net, which defines a word in a complicated multidimensional knowledge description language, is the lexical knowledgebase employed in this research. In how-net, a word is a group of small units (sememes) that describe the word meaning.
- The structure of sentences
Each sentence is represented by a semantic vector calculated using the joint word set of the sentences word sets in which, element values are in range [0,1].
-
- Word order similarity between sentences : That provides information about relationship between words, since word order plays a role in conveying the meaning of sentences.
So, the overall sentence similarity is defined as:
Where epsilon is a factor for weighting the significance between semantic information and word order information with 0.85 as a value which is empirically found.
Finally,this approach shows the best results comparing to other methods such as method based on semantic and words order proposed by Li Yuhua, method based on semantic, method based on words order. That's why we aim to use it in our project.
mercredi 5 décembre 2018
Paper Review 9 : Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection
In this post, the Article "Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection" is summarized.
Deepa Gupta, Dept. of Mathematics ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Vani K, Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Charan Kamal, Singh Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Vani K, Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Charan Kamal, Singh Dept. of Computer Science ASE, Bangalore Amrita Vishwa Vidyapeetham Bangalore, India
Published in :
2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Link to paper : Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiarism Detection
Summary:
Based on a personal experience, here in bilkent university where I was accused of cheating in a homework assigment, I discovered that plagiarism is one of the most serious crimes in academia. However, when I was trying to understand that rule and how it works, I discovered that plagiarism is based on similarity which is the main problem we are facing in the second part of our Natural language processing term project and this is why I chose this published paper that explain what plagiarism is and how it can be detected.
This paper investigates a number of different methods of external plagiarism detection where the suspucious documents are compared to a bunch of collected documents. In fact, semantics and linguistic variations play a very important role in that detection.
This paper investigates a number of different methods of external plagiarism detection where the suspucious documents are compared to a bunch of collected documents. In fact, semantics and linguistic variations play a very important role in that detection.
The general approach, dealing with external plagiarism, includes the following steps:
- pre-processing
- candidate document selection
- document comparisons
- passage boundary detection and evaluation
Some of the related works mentioned in this paper are;
- N-gram based method proposed by "Efstathios Stamatatos"
- Fuzzy-Token matching method based on string similarity join by "Jiannan Wang, Guoliang Li & Jianhua Feng"
Based on those works ( basic methods) , the proposed system performs as described in the Following schematic diagram :
The input is a set of source document and its corresponding suspicious document.
After comparison of suspicious text with a source text, all the matched N-grams are stored. Passages are formed in both suspicious and source document based on a passage boundary condition which acts as a threshold value. This value depends on total word length of text. Then the fuzzy-semantic method is used to improve the similarity calculation.
The output, using partial dataset from PAN 2012 corpus for testing, is divided into five categories according to the type of Plagiarized Passage; Highly, Low, Simulated, No-obfuscation and No-Plagiarized.
Finaly, using the data statistics given from those cathegories, the different pre-processing methods based on NLP techniques used in this system are compared using fuzzy-semantic and improved fuzzy-semantic similarity measures and the experimental results show that POSPIFS ( POS method integrated with improved fuzzy-semantic similarity measure) performs more efficiently, in terms of accuracy and efficiency, than the other methods.
jeudi 29 novembre 2018
Paper Review 8 :Text coherence new method using word2vec sentence vectors and most likely n-grams
In this post, the paper "Text coherence new method using word2vec sentence vectors and most likely n-grams" is summarized.
Link to paper: https://ieeexplore.ieee.org/document/8311598
Mohamad Abdolahi
Kharazmi, Morteza Zahedi
Kharazmi,Text coherence new method using word2vec sentence vectors and most likely n-grams, 3rd Iranian Conference on Intelligent Systems and Signal Processing (ICSPIS), Iran, 20-21 Dec. 2017. IEEE Xplore.
Summary:
This paper investigates the automatic evaluation of text coherence,
which is a fundamental post processing in many NLP tasks such as machine
translation and question answering .
The approach proposed in this paper combines the word2vec vectors and
most likely n-grams to assess the coherence and topic integrity of document. It uses new technologies like
deep learning, transforming words
into numerical vectors and using statistical methods to assess the
coherence of texts.
The authors evaluate the
text coherence with statistical
methods with both local and global coherence which captures text organization at the level of
sentence to sentence and paragraph to
paragraph transitions. Without caring about the meaning of words or the
handcrafted rules. So, the approach does not depend on the language and its semantic concepts and it has the ability to
apply on any language.
Instead of the other methods, the
preprocessing here is different.
First of all, each document is
transformed to separate sentences. Secondly, sentence’s matrix is created using
word2vec word vectors. Finally, it is normalized using n-grams model.
Stop words, stemming and POS
tagging are not performed.
However, some basic preprocessing
are used such as removing spacing between words and
punctuation marks, removing extra
spaces characters between words and unification of accented characters.
Not only the preprocessing is
different but also in previous approaches, local coherence is tested at the
level of several consecutive sentences. That is why, sections with an important distance may not have any
relation. However, in the proposed approach, local coherence is raised at the
level of a paragraph and a coherent paragraph is assumed a local coherent
section.
To conclude, this model is very
sufficient. It is robust among language and domains and it doesn’t suffer from computational complexity.
jeudi 22 novembre 2018
Paper Review 7 : A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences
In this post, the research Article "A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences" is summarized.
Ming Che Lee: Departement of computer and communication engineering, Ming chuan University, Taiwan.
Jia Wei Chang: Department of Engineering Science, National Cheng Kung University, Taiwan.
Tung Cheng Hsieh : Departement of visual communication Design, Hsuan Chuang University, Taiwan
Published 10 April 2014
Summary:
Since a part of our project is about finding similarity between terms (sentence) and documents (text; job descriptions), we chose this published paper that disscuss almost all the technics and approaches known , from the day computer scientist start the natural language processing to the day of writing this article.
This paper, first, presents the different existing models such as Latent Semantic analisys/indexing (LSA/LSI), Hyper data of language (HAL), Probabilistic latent semantic analysis/indexing (PLSA/PLSI) and the Vector space model (VSM), with a brief comparison. The conclusion was that those approaches calculate the similarity based on the number of shared terms in articles, instead of overlook the syntactic structure of sentences and some disadvantages may arise when applying them to calculate the similarity between short texts/sentences directly. Then, it presents a new approach which is a grammar and semantic corpus based similarity algorithm for natural language sentences.
This new approache addresses the limitations of these existing approaches by using grammatical rules and the WordNet ontology.
Traditional information retrival technologies may not always determine the perfect matching without obvious relation or concept overlap between two natural language sentences. Some approaches deal with this problem via determining the order of words and the evaluation of semantic vectors; however, they were hard to be applied to compare the sentences with complex syntax as well as long sentences and sentences with arbitrary patterns and grammars.
The proposed approach takes advantage of corpus-based ontology and grammatical rules to overcome this problem by a set of grammar matrices that is built for representing the relationships (corrolations) between pairs of sentences instead of considering common words or word order. The size of the set is limited to the maximum number of selected grammar links. The latent semantic of words is calculated via a WordNet similarity measure, semantic trees, that increases the chances of finding a semantic relation between any nouns and verbs.
finally,the results demonstrate that the proposed method performed very well both in the sentences similarity and the task of paraphrase recognition.
dimanche 11 novembre 2018
Paper Review 6 :PhraseCTM: Correlated Topic Modeling on Phrases within Markov Random Fields
In this post, the paper "PhraseCTM: Correlated Topic Modeling on Phrases within Markov Random Fields"is summarized.
Weijing Huang, Tengjiao Wang, Wei Chen, Siyuan Jiang, Kam-Fai Wong, PhraseCTM: Correlated Topic Modeling on Phrases within Markov
Random Fields, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 521–526
Melbourne, Australia, July 15 - 20, 2018.Association for Computational Linguistics.
Summary:
PhraseCTM is a
novel method proposed in order to find out the correlated topics at phrase
level.
This work is done
in two stages:
- The training stage to extract the model using Markov Random Fields to link the phrases and the words when they are semantically coherent.
- Generating the correlation of topics from PhraseCTM and evaluating the model using a quantitative experiment and human study.
CTM is a solution
for what?
It is easier to
understand the topic using Topic modeling on phrases “grounding conductor,
grounding wire, aluminum wiring” than to understand it using Topic modeling on
words“ground, wire, use, power, cable, wires” which doesn’t include the
context. However the first method is complex when the size is important.
So, CTM applies the
correlation structure in order to figure out the correlated relationship
between topics and group the similar topics together.
When CTM is
performing well?
Phrases are much
less than words in each document and CTM needs more contextual information to
build a performed model. So we don't use it with short documents.
How CTM works?
1. Training
PhraseCTM:
Transform a document into words and phrases
semantically coherent.
Link between phrases and component words:
- Calculate the NPMI metric.
- Define a threshold to take a decision.
- Double count the phrases as two parts, one as the phrase itself, the other as the component words.
- Model the generation of words and phrases simultaneously by linking the phrases and component words within Markov Random.
2. Generating the
correlation of topics :
Evaluate the
method on five datasets by a quantitative experiment and a human study.
Conclusion:
CTM is a solution
that helps finding the topic of the corpus and it has demonstrated a high-quality
phrase-level topics.
jeudi 8 novembre 2018
Paper Review 5: REVIEW ON NATURAL LANGUAGE PROCESSING
In this post, the paper "REVIEW ON NATURAL LANGUAGE PROCESSING" is summarized.
Prof. Alpa Reshamwala, Prajakta Pawar, Prof. Dhirendra Mishra,
IRACST – Engineering Science and Technology: An International Journal (ESTIJ), ISSN: 2250-3498, Vol.3, No.1, February 2013 Link to paper : REVIEW ON NATURAL LANGUAGE PROCESSING Summary: Natural Language processing is an branch of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) language. Natural languages are languages spoken by humans. Whatever the form of the communication, natural languages are used to express our knowledge and emotions and to convey our responses to other people and to our surroundings.The research work in NLP , which is being a very active area of research and development, has been increasingly addressed in the recent years.
IRACST – Engineering Science and Technology: An International Journal (ESTIJ), ISSN: 2250-3498, Vol.3, No.1, February 2013 Link to paper : REVIEW ON NATURAL LANGUAGE PROCESSING Summary: Natural Language processing is an branch of computer science, artificial intelligence and linguistics concerned with the interactions between computers and human (natural) language. Natural languages are languages spoken by humans. Whatever the form of the communication, natural languages are used to express our knowledge and emotions and to convey our responses to other people and to our surroundings.The research work in NLP , which is being a very active area of research and development, has been increasingly addressed in the recent years.
LEVELS OF NLP
The most explanatory method for presenting what actually happens within a NLP system is by means of the ‘levels of language’ approach. Psycholinguistic research suggests that language processing is much more dynamic, as the levels can interact in a variety of orders. For example, the pragmatic knowledge that the document you are reading is about biology will be used when a particular word that has several possible senses is encountered, and the word will be interpreted as having the biology sense.
A. Phonology : deals with the interpretation of speech sounds within
and across words.
B. Morphology : It looks at the ways in which words break
down into their components and how that affects
their grammatical status.
C. Semantics : It builds up a representation of the objects
and actions that a sentence is describing and includes
the details provided by adjectives, adverbs and propositions.
D. Pragmatics : is “the analysis of the real meaning of an
utterance in a human language, by disambiguating and
contextualizing the utterance”.
METHODS AND APPROACHES
A. Natural Language Processing for Speech Synthesis:
TTS synthesis makes use of NLP techniques extensively since text data is first input into the system and thus it must be processed in the first place.
B. Natural Language Processing for Speech Recognition:
Automatic Speech Recognition systems make use of NLP techniques in a fairly restricted way: they are based on grammars. (This paper refers to a grammar as a set of rules that determine the structure of texts written in a given language by defining its morphology and syntax.)
While NLP is a relatively recent area of research and application, as compared to other information technology approaches, there have been sufficient successes to date that suggest that NLP-based information access technologies will continue to be a major area of research and development in information systems now and far into the future.
jeudi 1 novembre 2018
Paper Review 4 :SuperNMT: Neural Machine Translation with Semantic Supersenses and Syntactic Supertags
In this post, the paper "SuperNMT: Neural Machine Translation with Semantic Supersenses and
Syntactic Supertags"is summarized.
Link to paper: http://aclweb.org/anthology/P18-3010?fbclid=IwAR2078q3nTRhoguSu36IHBxYRRmLmoNbZfC50Ruz3_2Ah3Id335n-FE2-Qw
Eva Vanmassenhove, Andy Way, SuperNMT: Neural Machine Translation with Semantic Supersenses and
Syntactic Supertags, Proceedings of ACL 2018, Student Research Workshop, pages 67–73
Melbourne, Australia, July 15 - 20, 2018. Association for Computational Linguistics
Summary:
Neural Machine Translation (NMT) learn by generalizing patterns based on the raw, sentences providing semantic supersensetags and syntactic
supertag together make NMT
system learn multi-word expressions.
This
exclusive research enhances word-embeddings
in NMT systems by providing a combination
of semantic supersenses like (CCG supertags)
and syntactic supertag like (SST).
CCG tags every word in a sentence with its
correct role (verb/noun/auxiliary) and therefore resolve ambiguity
in terms of prepositional attachment.
SST,
in the other hand, classifies words independently into their
part-of-speech (Modal/ Adverb/ noun)
These
2 new embedding vectors are then concatenated into the classical embedding vector and used in the model.
Adding explicitly this
level of semantics provides the translation system with a higher level of abstraction beneficial to learn more complex constructions.
3 NMT systems are trained
and tested on 1M sentences of the Europarl
corpus for EN–FR and EN–DE:
One based on supersenses
(SST), one on syntactic supertag (CCG), and one on both (SST–CCG).
Results
for the EN–DE system shows
that SST system converges
faster, as hypothesized. The learning curve
is also more consistent.
But
if we focus on later stages of the learning process,
CCG-SST model outperforms the best model: translations are 5% better comparing to the 2 other systems.
The results
for the EN–DE
system and EN–FR system are very similar and lead to the same conclusion: combining semantic and syntactic features is beneficial for generalization which lead to a better
translation.
Inscription à :
Articles (Atom)
-
In this post, the Article " Using Natural Language Processing Techniques and Fuzzy-Semantic Similarity for Automatic External Plagiari...