gensim logo

gensim
gensim tagline

Get Expert Help

• machine learning, NLP, data mining

• custom SW design, development, optimizations

• corporate trainings & IT consulting

models.logentropy_model – LogEntropy model

models.logentropy_model – LogEntropy model

class gensim.models.logentropy_model.LogEntropyModel(corpus, id2word=None, normalize=True)

Bases: gensim.interfaces.TransformationABC

Objects of this class realize the transformation between word-document co-occurence matrix (integers) into a locally/globally weighted matrix (positive floats).

This is done by a log entropy normalization, optionally normalizing the resulting documents to unit length. The following formulas explain how to compute the log entropy weight for term i in document j:

local_weight_{i,j} = log(frequency_{i,j} + 1)

P_{i,j} = frequency_{i,j} / sum_j frequency_{i,j}

                      sum_j P_{i,j} * log(P_{i,j})
global_weight_i = 1 + ----------------------------
                      log(number_of_documents + 1)

final_weight_{i,j} = local_weight_{i,j} * global_weight_i

The main methods are:

  1. constructor, which calculates the global weighting for all terms in

    a corpus.

  2. the [] method, which transforms a simple count representation into the

    log entropy normalized space.

>>> log_ent = LogEntropyModel(corpus)
>>> print(log_ent[some_doc])
>>> log_ent.save('/tmp/foo.log_ent_model')

Model persistency is achieved via its load/save methods.

normalize dictates whether the resulting vectors will be set to unit length.

initialize(corpus)

Initialize internal statistics based on a training corpus. Called automatically from the constructor.

load(fname, mmap=None)

Load a previously saved object from file (also see save).

If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. Default: don’t use mmap, load large arrays as normal objects.

If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then mmap=None must be set. Load will raise an IOError if this condition is encountered.

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset([]), pickle_protocol=2)

Save the object to file (also see load).

fname_or_handle is either a string specifying the file name to save to, or an open file-like object which can be written to. If the object is a file handle, no special array handling will be performed; all attributes will be saved to the same file.

If separately is None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This avoids pickle memory errors and allows mmap’ing large arrays back on load efficiently.

You can also set separately manually, in which case it must be a list of attribute names to be stored in separate files. The automatic check is not performed in this case.

ignore is a set of attribute names to not serialize (file handles, caches etc). On subsequent load() these attributes will be set to None.

pickle_protocol defaults to 2 so the pickled object can be imported in both Python 2 and 3.