Classifying documents into topics using LDA

In this post, I will show how to do topic modeling using Latent Dirichlet Allocation (LDA) in R and easily visualize the results using the LDAvis library.

This piece on topic modeling is based on: topic modeling using R.

For a theoretical overview on topic modeling, see http://videolectures.net/mlss09uk_blei_tm/

I have been searching online for a tutorial on LDA using topicmodels and LDAvis package in R, but to no avail. So here I write my own (hope you like it!)

First of all, you would need a term document matrix (e.g. the one created by the tm package). See this piece on how to do that. With that, training an LDA model is as simple as

ldaOut <- LDA(dtm, k=4, method="Gibbs",
control=list(nstart=5, seed=list(1, 2, 3, 4, 5), best=TRUE, burnin=4000, iter=2000, thin=500))

In the code block above, dtm is the document term matrix, k is the number of topics (you have to decide this prior to model training), and I used Gibbs sampling method with the following parameters:

• burnin: the number of initial walks to discard
• iter: how many iterations following the burn-in
• thin: how many iterations is for further use
• nstart: how many different starting points to use
• seed: a list corresponding to seeds used for every nstart
• best = TRUE will return the results with highest posterior probability.

Afterwards, you have to create a JSON object then pass it to the visualization template. You can either see the documentation or follow along these codes:

ldaOut.topics <- as.matrix(topics(ldaOut))
ldaOut.terms <- as.matrix(posterior(ldaOut)\$terms)
ldaOut.docLength <- rowSums(as.matrix(dtm))
ldaOut.termFreq <- colSums(as.matrix(dtm))
topicProbabilities <- as.data.frame(ldaOut@gamma)

# Visualize
jsonObj <- createJSON(phi=ldaOut.terms, theta=topicProbabilities,
vocab=colnames(ldaOut.terms), doc.length = ldaOut.docLength,
term.frequency = ldaOut.termFreq, R=10)

All that remains is to pass jsonObj to the function serVis (from LDAvis) or perhaps renderVis (if you’re embedding it to a shiny app).