In order to find an efficient algorithm
for our term project which has to measure the syntactic and semantic word similarities.
I have chosen this paper which introduces us two of the best model architectures
for computing continuous vector representations of words from large data sets
and compares them to other best-known techniques based on different types of
neural networks which we shall analyze during our work on the term project.
Word2vec is a particularly
computationally-efficient predictive model for learning word embeddings from
raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and
the Skip-Gram model.
Algorithmically,
these models are similar, except that CBOW predicts target words from source
context words, while the skip-gram does the inverse and predicts source context-words
from the target words.
According to Mikolov; Skip-gram model
works well with small amount of the training data and represents well even rare
words, but CBOW model is several times faster to train than the skip-gram and
slightly better accuracy for the frequent words
In order to to maximize the accuracy,
while minimizing the computational complexity of a model which is the number of
parameters that need to be accessed to fully train the model and was defined to
compare different model architectures, the authors of this paper developed new
model architectures that preserve the linear regularities among words. They designed
a new comprehensive test set for measuring both syntactic and semantic
regularities, and show that many such regularities can be learned with high
accuracy.
Moreover, the paper discuss how training
time and accuracy depends on the dimensionality of the word vectors and on the
amount of the training data.
Conference Paper : Efficient Estimation of Word Representations in Vector Space
Conference: Proceedings of the International
Conference on Learning Representations (ICLR 2013)
By : Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean
Aucun commentaire:
Enregistrer un commentaire