In this post, the paper "PhraseCTM: Correlated Topic Modeling on Phrases within Markov Random Fields"is summarized.
Weijing Huang, Tengjiao Wang, Wei Chen, Siyuan Jiang, Kam-Fai Wong, PhraseCTM: Correlated Topic Modeling on Phrases within Markov
Random Fields, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 521–526
Melbourne, Australia, July 15 - 20, 2018.Association for Computational Linguistics.
Summary:
PhraseCTM is a
novel method proposed in order to find out the correlated topics at phrase
level.
This work is done
in two stages:
- The training stage to extract the model using Markov Random Fields to link the phrases and the words when they are semantically coherent.
- Generating the correlation of topics from PhraseCTM and evaluating the model using a quantitative experiment and human study.
CTM is a solution
for what?
It is easier to
understand the topic using Topic modeling on phrases “grounding conductor,
grounding wire, aluminum wiring” than to understand it using Topic modeling on
words“ground, wire, use, power, cable, wires” which doesn’t include the
context. However the first method is complex when the size is important.
So, CTM applies the
correlation structure in order to figure out the correlated relationship
between topics and group the similar topics together.
When CTM is
performing well?
Phrases are much
less than words in each document and CTM needs more contextual information to
build a performed model. So we don't use it with short documents.
How CTM works?
1. Training
PhraseCTM:
Transform a document into words and phrases
semantically coherent.
Link between phrases and component words:
- Calculate the NPMI metric.
- Define a threshold to take a decision.
- Double count the phrases as two parts, one as the phrase itself, the other as the component words.
- Model the generation of words and phrases simultaneously by linking the phrases and component words within Markov Random.
2. Generating the
correlation of topics :
Evaluate the
method on five datasets by a quantitative experiment and a human study.
Conclusion:
CTM is a solution
that helps finding the topic of the corpus and it has demonstrated a high-quality
phrase-level topics.
Aucun commentaire:
Enregistrer un commentaire