gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.wrappers.ldamallet – Latent Dirichlet Allocation via Mallet

models.wrappers.ldamallet – Latent Dirichlet Allocation via Mallet

Python wrapper for Latent Dirichlet Allocation (LDA) from MALLET, the Java topic modelling toolkit [1].

This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents, using an (optimized version of) collapsed gibbs sampling from MALLET.

MALLET’s LDA training requires O(#corpus_words) of memory, keeping the entire corpus in RAM. If you find yourself running out of memory, either decrease the workers constructor parameter, or use LdaModel which needs only O(1) memory.

The wrapped model can NOT be updated with new documents for online training – use gensim’s LdaModel for that.

Example:

>>> model = gensim.models.wrappers.LdaMallet('/Users/kofola/mallet-2.0.7/bin/mallet', corpus=my_corpus, num_topics=20, id2word=dictionary)
>>> print model[my_vector]  # print LDA topics of a document
[1]http://mallet.cs.umass.edu/
class gensim.models.wrappers.ldamallet.LdaMallet(mallet_path, corpus=None, num_topics=100, alpha=50, id2word=None, workers=4, prefix=None, optimize_interval=0, iterations=1000, topic_threshold=0.0)

Bases: gensim.utils.SaveLoad

Class for LDA training using MALLET. Communication between MALLET and Python takes place by passing around data files on disk and calling Java with subprocess.call().

mallet_path is path to the mallet executable, e.g. /home/kofola/mallet-2.0.7/bin/mallet.

corpus is a gensim corpus, aka a stream of sparse document vectors.

id2word is a mapping between tokens ids and token.

workers is the number of threads, for parallel training.

prefix is the string prefix under which all data files will be stored; default: system temp + random filename prefix.

optimize_interval optimize hyperparameters every N iterations (sometimes leads to Java exception; 0 to switch off hyperparameter optimization).

iterations is the number of sampling iterations.

topic_threshold is the threshold of the probability above which we consider a topic. This is basically for sparse topic distribution.

convert_input(corpus, infer=False, serialize_corpus=True)

Serialize documents (lists of unicode tokens) to a temporary text file, then convert that text file to MALLET format outfile.

corpus2mallet(corpus, file_like)

Write out corpus in a file format that MALLET understands: one document per line:

document id[SPACE]label (not used)[SPACE]whitespace delimited utf8-encoded tokens[NEWLINE]
fcorpusmallet()
fcorpustxt()
fdoctopics()
finferencer()
fstate()
ftopickeys()
fwordweights()
get_version(direc_path)

function to return the version of mallet

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

load_document_topics()

Return an iterator over the topic distribution of training corpus, by reading the doctopics.txt generated during training.

load_word_topics()
print_topic(topicid, topn=10)
print_topics(num_topics=10, num_words=10)
read_doctopics(fname, eps=1e-06, renorm=True)

Yield document topic vectors from MALLET’s “doc-topics” format, as sparse gensim vectors.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

show_topic(topicid, topn=10)
show_topics(num_topics=10, num_words=10, log=False, formatted=True)

Print the num_words most probable words for num_topics number of topics. Set num_topics=-1 to print all topics.

Set formatted=True to return the topics as a list of strings, or False as lists of (weight, word) pairs.

train(corpus)