In this post, the research Article "A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences" is summarized.
Published 10 April 2014
Summary:
Since a part of our project is about finding similarity between terms (sentence) and documents (text; job descriptions), we chose this published paper that disscuss almost all the technics and approaches known , from the day computer scientist start the natural language processing to the day of writing this article.
This paper, first, presents the different existing models such as Latent Semantic analisys/indexing (LSA/LSI), Hyper data of language (HAL), Probabilistic latent semantic analysis/indexing (PLSA/PLSI) and the Vector space model (VSM), with a brief comparison. The conclusion was that those approaches calculate the similarity based on the number of shared terms in articles, instead of overlook the syntactic structure of sentences and some disadvantages may arise when applying them to calculate the similarity between short texts/sentences directly. Then, it presents a new approach which is a grammar and semantic corpus based similarity algorithm for natural language sentences.
This new approache addresses the limitations of these existing approaches by using grammatical rules and the WordNet ontology.
Traditional information retrival technologies may not always determine the perfect matching without obvious relation or concept overlap between two natural language sentences. Some approaches deal with this problem via determining the order of words and the evaluation of semantic vectors; however, they were hard to be applied to compare the sentences with complex syntax as well as long sentences and sentences with arbitrary patterns and grammars.
The proposed approach takes advantage of corpus-based ontology and grammatical rules to overcome this problem by a set of grammar matrices that is built for representing the relationships (corrolations) between pairs of sentences instead of considering common words or word order. The size of the set is limited to the maximum number of selected grammar links. The latent semantic of words is calculated via a WordNet similarity measure, semantic trees, that increases the chances of finding a semantic relation between any nouns and verbs.
finally,the results demonstrate that the proposed method performed very well both in the sentences similarity and the task of paraphrase recognition.
Aucun commentaire:
Enregistrer un commentaire