Topic Modeling: Avalanches of Words, Sifted and Sorted

Posted on March 25, 2012 by

From The NY Times:

David M. Blei of Princeton University is among those who are teaching computers to sift through the digital pages of books and articles and categorize the contents by subject, even when that subject isn’t stated explicitly.

For decades, of course, librarians and many others have labeled books and documents with keywords. “But human categorization can only go so far,” said Dr. Blei, an associate professor in computer science. “We don’t have the human power to read and tag all this information.”

To cope with the information explosion, Dr. Blei and other researchers write algorithms so that computers can sift through millions of works and find their common themes by sorting related words into categories. It’s a field called probabilistic topic modeling.


The Bookworm-arXiv interface is the latest in a series of tools developed by the Cultural Observatory. Late in 2010, in collaboration with Google, the lab released the Google n-gram viewer, which lets people search for a phrase of up to five words in Google’s database of scanned books and see the frequency of the words over time in a graph, Dr. Aiden said.

Project Discussed:

Posted in: Uncat