In this post, I will show how to do topic modeling using Latent Dirichlet Allocation (LDA) in R and easily visualize the results using the LDAvis library.
This piece on topic modeling is based on: topic modeling using R.
For a theoretical overview on topic modeling, see http://videolectures.net/mlss09uk_blei_tm/
I have been searching online for a tutorial on LDA using topicmodels and LDAvis package in R, but to no avail. So here I write my own (hope you like it!)
ldaOut <- LDA(dtm, k=4, method="Gibbs", control=list(nstart=5, seed=list(1, 2, 3, 4, 5), best=TRUE, burnin=4000, iter=2000, thin=500))
In the code block above, dtm is the document term matrix, k is the number of topics (you have to decide this prior to model training), and I used Gibbs sampling method with the following parameters:
- burnin: the number of initial walks to discard
- iter: how many iterations following the burn-in
- thin: how many iterations is for further use
- nstart: how many different starting points to use
- seed: a list corresponding to seeds used for every nstart
- best = TRUE will return the results with highest posterior probability.
Afterwards, you have to create a JSON object then pass it to the visualization template. You can either see the documentation or follow along these codes:
ldaOut.topics <- as.matrix(topics(ldaOut)) ldaOut.terms <- as.matrix(posterior(ldaOut)$terms) ldaOut.docLength <- rowSums(as.matrix(dtm)) ldaOut.termFreq <- colSums(as.matrix(dtm)) topicProbabilities <- as.data.frame(ldaOut@gamma) # Visualize jsonObj <- createJSON(phi=ldaOut.terms, theta=topicProbabilities, vocab=colnames(ldaOut.terms), doc.length = ldaOut.docLength, term.frequency = ldaOut.termFreq, R=10)
All that remains is to pass jsonObj to the function serVis (from LDAvis) or perhaps renderVis (if you’re embedding it to a shiny app).