Topic modeling【经典模型】
http://www.cs.princeton.edu/~blei/topicmodeling.html
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts.
Below, you will find links to introductory materials, corpus browsers based on topic models, and open source software (from my research group) for topic modeling.
Introductory materials
- I wrote a general introduction to topic modeling.
- John Lafferty and I wrote a more technical review paper about this field.
- Here are slides from some recent tutorials about topic modeling:
- Here is a video from a talk on dynamic and correlated topic models applied to the journal Science . (Here are the slides.)
- David Mimno maintains a bibliography of topic modeling papers and software.
- The topic models mailing list is a good forum for discussing topic modeling.
Corpus browsers based on topic models
The structure uncovered by topic models can be used to explore an otherwise unorganized collection. The following are browsers of large collections of documents, built with topic models.
- A 100-topic browser of the dynamic topic model fit to Science (1882-2001).
- A 100-topic browserof the correlated topic model fit to Science (1980-2000)
- A 50-topic browser of latent Dirichlet allocation fit to the 2006 arXiv.
- A 20-topic browserof latent Dirichlet allocation fit to The American Political Science Review
Also see Sean Gerrish's discipline browser for an interesting application of topic modeling at JSTOR.
To build your own browsers, see Allison Chaney's excellent Topic Model Visualization Engine(TMVE). For example, here is a browser of 100,000 Wikipedia articles that uses TMVE.
Topic modeling software
Our research group has released many open-source software packages for topic modeling. Please post questions, comments, and suggestions about this code to the topic models mailing list.
Link | Model/Algorithm | Language | Author | Notes |
lda-c | Latent Dirichlet allocation | C | D. Blei | This implements variational inference for LDA. |
class-slda | Supervised topic models for classifiation | C++ | C. Wang | Implements supervised topic models with a categorical response. |
lda | R package for Gibbs sampling in many models | R | J. Chang | Implements many models and is fast . Supports LDA, RTMs (for networked documents), MMSB (for network data), and sLDA (with a continuous response). |
online lda | Online inference for LDA | Python | M. Hoffman | Fits topic models to massive data. The demo downloads random Wikipedia articles and fits a topic model to them. |
online hdp | Online inference for the HDP | Python | C. Wang | Fits hierarchical Dirichlet process topic models to massive data. The algorithm determines the number of topics. |
tmve(online) | Topic Model Visualization Engine | Python | A. Chaney | A package for creating corpus browsers. See, for example,Wikipedia. |
ctr | Collaborative modeling for recommendation | C++ | C. Wang | Implements variational inference for a collaborative topic models. These models recommend items to users based on item content and other users' ratings. |
dtm | Dynamic topic models and the influence model | C++ | S. Gerrish | This implements topics that change over time and a model of how individual documents predict that change. |
hdp | Hierarchical Dirichlet processes | C++ | C. Wang | Topic models where the data determine the number of topics. This implements Gibbs sampling. |
ctm-c | Correlated topic models | C | D. Blei | This implements variational inference for the CTM. |
diln | Discrete infinite logistic normal | C | J. Paisley | This implements the discrete infinite logistic normal, a Bayesian nonparametric topic model that finds correlated topics. |
hlda | Hierarchical latent Dirichlet allocation | C | D. Blei | This implements a topic model that finds a hierarchy of topics. The structure of the hierarchy is determined by the data. |
turbotopics | Turbo topics | Python | D. Blei | Turbo topics find significant multiword phrases in topics. |
最新文章
- ssh 注解写法
- Oracle PL/SQL 多重选择句
- python mysqldb连接数据库
- Python设计模式——装饰模式(Decorator)
- 5shift shell
- 转:基于开源项目OpenCV的人脸识别Demo版整理(不仅可以识别人脸,还可以识别眼睛鼻子嘴等)【模式识别中的翘楚】
- c++设置输出精度
- 使用 Git 报错 error: src refspec master matches more than one.
- qtp childObjects用法
- [Python Web]配置 nginx 遇到错误排查(初级)
- Dubbo源码分析系列---服务的发布
- windows下键盘常用快捷键整理
- 【BZOJ4010】【HNOI2015】菜肴制作(拓扑排序)
- Python学习笔记-Django连接SQLSERVER
- 步步为营102-Css样式加个版本
- 事件Event实现消费者模型
- js便签笔记(9)——解读jquery源码时记录的一些知识点
- idea 使用教程
- APK方法数超过65535及MultiDex解决方案
- 《KAFKA官方文档》入门指南(转)