Classifying documents into topics using LDA

In this post, I will show how to do topic modeling using Latent Dirichlet Allocation (LDA) in R and easily visualize the results using the LDAvis library.

This piece on topic modeling is based on: topic modeling using R.

For a theoretical overview on topic modeling, see http://videolectures.net/mlss09uk_blei_tm/

I have been searching online for a tutorial on LDA using topicmodels and LDAvis package in R, but to no avail. So here I write my own (hope you like it!)

First of all, you would need a term document matrix (e.g. the one created by the tm package). See this piece on how to do that. With that, training an LDA model is as simple as

ldaOut <- LDA(dtm, k=4, method="Gibbs",
control=list(nstart=5, seed=list(1, 2, 3, 4, 5), best=TRUE, burnin=4000, iter=2000, thin=500))

In the code block above, dtm is the document term matrix, k is the number of topics (you have to decide this prior to model training), and I used Gibbs sampling method with the following parameters:

  • burnin: the number of initial walks to discard
  • iter: how many iterations following the burn-in
  • thin: how many iterations is for further use
  • nstart: how many different starting points to use
  • seed: a list corresponding to seeds used for every nstart
  • best = TRUE will return the results with highest posterior probability.

Afterwards, you have to create a JSON object then pass it to the visualization template. You can either see the documentation or follow along these codes:

ldaOut.topics <- as.matrix(topics(ldaOut))
ldaOut.terms <- as.matrix(posterior(ldaOut)$terms)
ldaOut.docLength <- rowSums(as.matrix(dtm))
ldaOut.termFreq <- colSums(as.matrix(dtm))
topicProbabilities <- as.data.frame(ldaOut@gamma)


# Visualize
jsonObj <- createJSON(phi=ldaOut.terms, theta=topicProbabilities,
   vocab=colnames(ldaOut.terms), doc.length = ldaOut.docLength,
   term.frequency = ldaOut.termFreq, R=10)

All that remains is to pass jsonObj to the function serVis (from LDAvis) or perhaps renderVis (if you’re embedding it to a shiny app).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s