gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.hdpmodel – Hierarchical Dirichlet Process

models.hdpmodel – Hierarchical Dirichlet Process

This module encapsulates functionality for the online Hierarchical Dirichlet Process algorithm.

It allows both model estimation from a training corpus and inference of topic distribution on new, unseen documents.

The core estimation code is directly adapted from the onlinelhdp.py script by C. Wang see Wang, Paisley, Blei: Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011).

http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf

The algorithm:

  • is streamed: training documents come in sequentially, no random access,
  • runs in constant memory w.r.t. the number of documents: size of the training corpus does not affect memory footprint
class gensim.models.hdpmodel.HdpModel(corpus, id2word, max_chunks=None, max_time=None, chunksize=256, kappa=1.0, tau=64.0, K=15, T=150, alpha=1, gamma=1, eta=0.01, scale=1.0, var_converge=0.0001, outputdir=None)

Bases: gensim.interfaces.TransformationABC

The constructor estimates Hierachical Dirichlet Process model parameters based on a training corpus:

>>> hdp = HdpModel(corpus, id2word)
>>> hdp.print_topics(show_topics=20, num_words=10)

Inference on new documents is based on the approximately LDA-equivalent topics.

Model persistency is achieved through its load/save methods.

gamma: first level concentration alpha: second level concentration eta: the topic Dirichlet T: top level truncation level K: second level truncation level kappa: learning rate tau: slow down parameter max_time: stop training after this many seconds max_chunks: stop after having processed this many chunks (wrap around corpus beginning in another corpus pass, if there are not enough chunks in the corpus)

doc_e_step(doc, ss, Elogsticks_1st, word_list, unique_words, doc_word_ids, doc_word_counts, var_converge)

e step for a single doc

evaluate_test_corpus(corpus)
hdp_to_lda()

Compute the LDA almost equivalent HDP.

inference(chunk)
load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

optimal_ordering()

ordering the topics

print_topics(num_topics=20, num_words=20)

Alias for show_topics() that prints the num_words most probable words for topics number of topics to log. Set topics=-1 to print all topics.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.

save_options()

legacy method; use self.save() instead

save_topics(doc_count=None)

legacy method; use self.save() instead

show_topics(num_topics=20, num_words=20, log=False, formatted=True)

Print the num_words most probable words for topics number of topics. Set topics=-1 to print all topics.

Set formatted=True to return the topics as a list of strings, or False as lists of (weight, word) pairs.

update(corpus)
update_chunk(chunk, update=True, opt_o=True)
update_expectations()

Since we’re doing lazy updates on lambda, at any given moment the current state of lambda may not be accurate. This function updates all of the elements of lambda and Elogbeta so that if (for example) we want to print out the topics we’ve learned we’ll get the correct behavior.

update_finished(start_time, chunks_processed, docs_processed)
update_lambda(sstats, word_list, opt_o)
class gensim.models.hdpmodel.HdpTopicFormatter(dictionary=None, topic_data=None, topic_file=None, style=None)

Bases: object

STYLE_GENSIM = 1
STYLE_PRETTY = 2
format_topic(topic_id, topic_terms)
print_topics(num_topics=10, num_words=10)
show_topic_terms(topic_data, num_words)
show_topics(num_topics=10, num_words=10, log=False, formatted=True)
class gensim.models.hdpmodel.SuffStats(T, Wt, Dt)

Bases: object

set_zero()
gensim.models.hdpmodel.dirichlet_expectation(alpha)

For a vector theta ~ Dir(alpha), compute E[log(theta)] given alpha.

gensim.models.hdpmodel.expect_log_sticks(sticks)

For stick-breaking hdp, return the E[log(sticks)]

gensim.models.hdpmodel.lda_e_step(doc_word_ids, doc_word_counts, alpha, beta, max_iter=100)
gensim.models.hdpmodel.log_normalize(v)